基于主动搜索的论坛内容监管技术研究

英文题名：Research on BBS Content Supervision Technology Based on Active Search
作者：耿乐群
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：内容监管 ; 论坛BBS ; 主动模式 ; 网络爬虫 ; URL去重
英文关键词：content supervision ; forum BBS ; active mode ; web crawler ; duplicated URL removal
学位年度：2011
导师：王巍
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2011-01-01

摘要

随着因特网的愈加普及,互联网正在成为一种不可缺少的信息传播媒介。但同时,网上的不良信息如反动、色情等内容也随之扩散,极大的影响了国家的安定和人民群众的身心健康。论坛作为一种网民常用的互联网应用形式,在方便网民的同时,也面临着传播不良有害信息的问题。为了良好的网络文化氛围和环境,对论坛进行内容监控十分必要。
     论坛内容监管在实现上有主动和被动两种模式。主动模式有其自身的优点,针对主动模式中面临的问题,本文主要就以下两个问题进行了研究与实现：
     主动模式中使用网络爬虫技术获取论坛的页面,为论坛监管提供原始内容,但对于需要用户登录才可以查看的网页内容的论坛,爬虫获得的页面往往是登录页面,这对论坛内容监管毫无意义。针对这一问题,本文在详细分析用户登录过程和原理的基础上,给出并设计实现了一种基于Cookie和爬虫结合的论坛受限内容获取方案,通过相对自动的方式的获取认证Cookie用于获取论坛受限页面内容,并通过实验证明了该方案的可行性。
     在网络爬虫的运行过程中,为避免对同一网页的重复下载,需要快速高效的URL去重技术。利用哈希去重是一个重要的研究方向,本文研究了基于K-Picked哈希算法的URL去重方法,在研究原算法原理和不足的基础上,对原算法进行了改进和优化,采用了扩大算法中普通字符的范围,增加除数的离散程度和将K值随机化的手段,降低了最终压缩编码的冲突率,最后通过多个实验验证了改进后算法在URL去重中取得了较为良好的效果。
With the increasingly popularity of the Internet, the Internet is becoming an indispensable information media. But at the same time, online information such as adverse reaction, proliferation of pornography and other content also will greatly influence the country's stability and people's health. Forum is used as an commonly Internet application form. It facilitates users greatly. At the same time, it is also facing the problem of harmful information. For a good network of culture and environment, forum content monitoring is necessary.
     In the realization, there are two ways of forum content regulation. They are active mode and passive mode. Active mode has its own advantages. For the problems faced by active mode, the paper mainly researches on the following two issues.
     Active mode uses Web crawler technology to obtain forum pages, in order to provide original content for regulation, but some forums require users to log in before they can view the content, Web crawler can only get the login page which is meaningless for content regulation. To solve this problem, this paper analyzes the user login process and presents a method based on the forum Cookies and Web crawler. It can get restricted page content from forums by using certificated Cookies in an automated way relatively. Experiments have proved the feasibility of the program.
     While the Web crawler is processing, duplicated URLs need to be removed quickly and efficiently in order to avoid downloading the same page repeatedly. Hashing is an important research direction. Based on K-Picked hash algorithm, this paper studied the theory and the lack of the original algorithm, proposed an improved scheme. By expanding the scope of ordinary characters, increasing the dispersion of the divisor and randomizing K discrete value, the improved algorithm has achieved a relatively good result which is proved by a series of experimental.

引文

[1]中国互联网信息中心.第22次中国互联网络发展状况统计报告[EB/OL].http://www. cnnic.net.cn/uploadfiles/pdf/2008/7/23/170516.pdf,2008-7-23
    [2]崔柏.我国互联网发展现状及趋势研究.信息与电脑.2010,21(11)：45-47页
    [3]周梦琪.网络信息监控取证系统的分析与设计.华东师范大学硕士学位论文.2009：3-5页
    [4]Sobel DL, TR McCarthy.Will carnivore devour online privacy.IEEE Computer,2001, 31(5):87-88P
    [5]Tak Yan,Hector Garcia-Molina.SIFT-A tool for wide-area information dissemination.Pr oceedings of the USENIX Technical Conference.New Orleans, Louisiana:USENIXAss oc,1995:177-186P
    [6]张铮.基于内容分析的网络监控系统.首都师范大学硕士学位论文.2009：3-5页
    [7]徐卫.BBS论坛敏感信息发现与识别技术.上海交通大学硕士学位论文.2007：6页
    [8]程亮,何志浩,李龙.内容安全监控下的中文BBS结构和用语研究.科技情报开发与经济.2008,18(1)：96-97页
    [9]姚晓娜.BBS热点话题挖掘与观点分析.大连海事大学硕士学位论文.2008：6-7页
    [10]郑栋辉.基于演化理论的BBS热点话题发现.上海交通大学硕士学位论文.2010：12-14页
    [11]李艳玲.BBS内容安全监管系统框架及其关键技术.中国电子科学研究院学报.2007,2(2)：144-145页
    [12]Stuart Russel, Peter Norvig.人工智能——一种现代方法.姜哲,金奕江,张敏译.第二版.北京：人民邮电出版社.2004：59～65页
    [13]史忠植.高级人工智能.北京：科学出版社.2006：161-196页
    [14]Richard JohnsonBaugh, Mareussehaefer.大学算法教程.方存正,曹旻,华明译.北京：清华大学出版社.2007：139～163页
    [15]郑健玲.定题爬虫搜索策略研究.厦门大学硕士学位论文.2007：11-12页
    [16]陈莉君.深入分析Linux内核源代码.北京：人民邮电出版社.2007：417-419页
    [17]黎源,王会进.Linux下面向对象的Socket程序设计研究.计算机应用与软件.2010,27(12)：27-28页
    [18]Minidxer.开源网络爬虫结构分析[EB/OL].http://blog.minidx.com/2009/01/01/1862.ht ml,2009-1-1
    [19]Ian Jackson. Advanced, easy to use, asynchronous-capable DNS client library and utilit ies[EB/OL].http://www.chiark.greenend.org.uk/-ian/adns/,2006-9-7
    [20]Shaffer CA.数据结构与算法分析.张铭,刘晓丹译.北京：电子工业出版社.1998：211-213页
    [21]李晓明,凤旺森.两种对URL的散列效果很好的函数.软件学报.2004,1515(2)：179-184页
    [22]肖明忠,闵博楠,王佳聪,代亚非.一个实用的针对URL的哈希函数.小型微型计算机系统.2006,27(3)：538-541页
    [23]Bloom B.Spaee/time tradeoffs in hash coding with allowable error. Communication of t-he ACM,1970.13(7):422-426P
    [24]肖明忠,代亚非.BloomFilter及其应用综述.计算机科学.2004,31(4)：180-183页
    [25]Kartik Gopalan, Tzicker Chiueh. SBFilter:A Fast URL Filter Engine for Internet Acces s Management.State University of New York,1999:1P
    [26]Rabin M O.Fingerprinting by Random Polynomials. Harvard University Center for Res earch in Computing Technology,1981:1P
    [27]Andrei Broder.Some Applications of Rabin's Fingerprinting Method. In R.Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences Ⅱ:Methods in Communications, Security, and Computer Science,1993:143-152P
    [28]张帆,李琳娜,杨炳儒.基于Web的智能信息采集及处理系统设计与实现.计算机工程.2007,33(18)：265-267页
    [29]Derek Leonard, Hsin-Tsang Lee, Xiaoming Wang, Dmitri Loguinov. IRLbot Scaling to 6 Billion Pages and Beyond. WWW2008/Refereed Track:Search Crawlers April 21--25. Beijing,China,2008:427-436P
    [30]Michel, B.S., Nikoloudakis K, Reiher P, Lixia Zhan. URL forwarding and compression in adaptive web caching. IIEEE INFOCOM 2000. Nineteenth Annual Joint, University of California,2000:1-2P
    [31]Koht-arsa K, Sanguanpong S. In-memory URL Compression, National Computer Sci ience and Engineering Conference, Chiang Mai, Thailand, Novermber 7-9,2001:425 -428P
    [32]Genova Z, Christensen K. Efficient Summarization of URLs using CRC32 for Impleme nting URL Switching, CONFERENCE ON LOCAL COMPUTER NETWORKS,200 2:102-105P
    [33]Broder A Z, Najork M, Wiener J L.Efficient URL Caching for World Wide Web Crawli ng.World Wide Web, May 20-24,2003, Budapest, Hungary,2003:101-105P
    [34]Heng Ma. Fast blocking of undesirable web pages on client PC by discriminating URL using neural networks. Expert Systems with Applications,2008(34):1533-1540P
    [35]Network Working Group.Hypertext Transfer Protocol--HTTP/1.1[EB/OL].Http://www. ietf.org/rfc/rfc2616.txt,1999-6-1
    [36]Network Working Group. HTTP State Management Mechanism[EB/OL].http://www.ie tf.org/rfc/rfc2616.txt,1997-2-20
    [37]沈海波,洪帆.基于Cookie的Web服务安全认证系统.计算机工程与设计.2006,27(5)：762-764页
    [38]龚秋艳,陈良育,曾振柄.简单高效的URL消重的方法.计算机应用.2010,30(1)：49-52页
    [39]搜狗网络实验室.SogouRank库SogouT-Rank[EB/OL]. http://www.sogou.com/labs/ resources.html,2008-9-1
    [40]Network Working Group. RFC 1321-The MD5 Message-Digest Algorithm [EB/OL].h ttp://www.ietf.org/rfc/rfc1321.txt,1992-4-1

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700