基于向量空间模型的网页过滤研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

基于向量空间模型的网页过滤研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Web Page Information Filtering Method Research Based on Vector Space Model
作者：李中原
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：向量空间模型 ; 网页正文内容抽取 ; WinPcap ; 文本表示
英文关键词：vector space model ; web extraction ; WinPcap ; document representation
学位年度：2010
导师：杨守文
学科代码：081203
学位授予单位：北京化工大学
论文提交日期：2010-05-24
答辩委员会主席：赵瑞莲

摘要

随着网络信息技术的飞速发展,用户可以通过网络方便快捷地利用海量的共享信息,同时“信息爆炸”、“信息过载”、“信息垃圾”等很多问题日趋严重。而且那些无用或者有害信息的信息量远远超过了我们所需要的信息量,这给人们带来了很多不便。如何准确地表达用户需求,进而在大规模的信息流中自动地筛选出满足用户需求的信息并过滤掉无用信息和不良信息,使人们更有效地利用信息资源,已经使我们亟待解决的问题。基于以上存在的问题,本文提出了一个基于局域网中的信息过滤研究的课题。它不仅可以实现不良网页的过滤,也可以实现基于兴趣主题的网页过滤。
本文介绍了网页文本过滤的发展现状、信息过滤的方法,并详细讨论了在网页文本过滤中所用到的关键技术及其实现的过程。基于网页的过滤研究,本文是采用了分级过滤的策略,首先是对流经网关的数据包实行基于IP和关键字的过滤技术,然后重点论述了基于DOM树的网页正文抽取的实现过程和基于内容的过滤技术。对于网页正文的提取部分本文实现了基于DOM树的正文提取方法。它使用户能够根据自己的需要设定参数并得到想要的结果,这样网页内容的提取结果不随网页结构的变化而变化。基于内容的过滤技术包含两个重要部分,即对网络数据信息的处理部分和对网页文本的信息处理部分。对网络数据信息的处理部分,本文主要论述了基于Windows的WinPcap下数据包的捕获,通过对TCP协议、IP协议、HTTP消息的分析,过滤掉不包含text\html的数据包,然后实现一种链表重装的数据包还原算法把网页还原出来,同时在基于关键字过滤的过程中,本文采用了改进后的多关键字匹配算法,即基于协议分析的多关键字匹配算法,可以提高匹配效率。在网页文本的处理部分,主要对网页正文的提取进行了实现和文本表示进行了改进。针对网页这种特殊的文档,本文用改进的向量空间模型来表示文本。本文就是通过依次提取模板中的特征词,在网页文本出现的位置进行精确处理,避免了对整篇文档进行处理,尤其是当信息流中非相关文档多于相关文档以及大文本数据的处理,可以大大提高网页处理时间和精确度。最后,本文论述了对用户模板的学习,采用了改进了Rocchio算法来更新模板,提高了网页过滤的精确率。
With the rapid development of information technology network, the user can easily and quickly through the network using vast amounts of shared information, while "information explosion", "information overload", "Information Junk" and other problems become increasingly serious. And those useless or harmful information of the amount of information far exceeds the amount of information we need, it brought a lot of inconvenience to people. How to accurately express the user needs, and then the information flow in large-scale automatically selected to meet user needs the information and filter out useless information and bad information to make people more effective use of information resources, has enabled us to problems to be solved. Based on the above problem, this paper presents LAN-based information filtering in study. It not only allows filtering undesirable web page can also achieve a Web filtering based on subject interest.
This article describes the development of web filtering this situation, information filtering methods, and discussed in detail in the page text filter in the key technology used in its realization of the process, the last of the user template adaptive learning. Web-based filtering, this paper is the classification used filtering strategy, starting with the implementation of data packets flowing through the gateway, and keyword filtering based on IP technology, and finally focuses on the content-based filtering technology and implementation process. Content-based filtering technology consists of two parts, namely, the network data processing part and the text on the web page information processing section. The processing of network data, this article mainly discusses the WinpCap under Windows based packet capture and protocol by TCP, IP protocol, HTTP message analysis, filter does not contain text\html data packets, and then propose a linked list of packet reduction algorithm reloading the page to restore them, while in the process of filtering based on keywords, this paper, the improved multi-keyword matching algorithm which is based on protocol analysis of more than keyword matching algorithm can greatly improve the match efficiency. In the page of text processing, this article uses the vector space model to represent the form of text, web text for this particular document, this improved vector space model to that text. As the web has a special structure of the text is the text, it contains useful information mainly between labels in certain pages, this is by order of feature extraction of the template, the text to appear on the page accurate processing, avoids entire document handling, particularly when the information flow more than the documentation related to Africa and the large text data document, can greatly improve the efficiency of web page classification. Finally, we describe the template of the user's learning; improve the Rocchio algorithm used to update the template, to improve the web filtering precision.

引文

[1]Luhn H P.A business intelligence system. IBM Journal of Research and Development 19582(4):314-319
    [2]Edward M H. Survey of current systems for selective dissemination of information. Technical Report SIG/SDI-1, American Society for Information Science Special Interest Group on SDI, Washington DC, June 1969
    [3]Peter J D. Electronic junk. Communications of the ACM,1982,25(3):163-165
    [4]Thomas W M, Kenneth R G, Franklyn A T et al. Intelligent information sharing systems. Communications of the ACM 1987 390-402
    [5]E. Voorhees, Presentation to the Text Retrieval Conference(TREC.9), November 9-12, 2000 Gaithersburg, MD, USA
    [6]Charles Wayne, Topic Detection & Tracking(TDT) Overview & Perspective, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop,1998
    [7]苏贵洋,李建华等。用于中文色情文本过滤的近邻法构招算法。上海交通大学学报,第38卷增刊,2004年10月
    [8]Guermazi Radhouane, Hanunami Mohamed, HamadouAbdelmajid Ben-WebAngels filter: a violent web filtering engine using textual and Structural content-based analysis[A]. In Proceedings of 8th Industrial Conference on Data Mining[C]. LeiPzig,2008:268-282
    [9]Chau Michael, Chen Hsinehum. A machine learning approach to web Page filtering using content and structure analysis [J]. Decision Support Systems,2008,44(2):482-494
    [10]唐坚刚,熊国萍.自适应不良网页过滤模式的研究与实践(J].计算机工程与设计,2008,29(20)：5324-5326
    [11]方春晖,《“内容过滤”的客观需求及其实现》,中国电信股份有限公司研究院
    [12]曹天杰,《WWW的信息监控研究》,《通讯和计算机》,2005年,第2期
    [13]Daniel. E, O, The Internet, Intranet, and the AI Renaissance, Computer,vol.30, No.1,Jan.1997,71-78 David D. Lewis. The TREC-4 filtering track. InD. K. Harman, editor, The Fourth Text
    [14]Retrieval Conference(TREC-4), pages 165-180, Gaithersburg, MD,1996. U. S. Dept of Commerce. National Institute of Standards and Technology.
    [15]徐小琳.基于内容的智能信息过滤系统的研究、设计和实现[硕士学位论文]：北京邮电大学
    [16]徐小琳,程时端.信息过滤技术和个性化信息服务[J].计算机工程与应用,2003,34(9)：182-184
    [17]张东艳,殷丽华,云晓春,《面向内容安全的多模精确匹配算法性能分析》,《通信学报》,2005年7月
    [18]梅海燕,信息过滤问题的研究现代图书情报技术,2002,(2)：44-47
    [19]Lin Shian-Hua. Extracting classification knowledge of internet documents with mining term associations:A semantic approach. ACM SIGIR Conference, August 1998:241-249
    [20]夏迎炬.文本过滤关键技术研究[D].上海：复旦大学,2003
    [21]中国科学院计算技术研究所,计算所汉语词法分析系统ICTCLAS, http://mtgroup.ict.ac.cn/-zhp/ICTCLAS/index.html
    [22][6]孙霞,程宏斌,基于java的DOM解析技术,计算机时代,2004年第7期,p1-2。
    [23]赵静,现代信息查询与利用,科学出版社,2004年8月第一版,p14-16.
    [24]Ing.Mario Baldi, Loris Degioanni Development of an Architecture for Packet Capture and Network Traffic Analysis.
    [25]Richard Stevens W. TCP/IP详解,卷1：协议[M].范建华,译.北京：机械工业出版社,2000.
    [26]谢希仁.计算机网络(第4版)M.北京：电子工业出版社2005.
    [27]傅龙天.基于WinPcap的入侵检测[J].电脑知识与技术,2009,(26)：7385-7386,7399.
    [28]谭思亮著.监听与隐藏—网络侦听揭密与数据保护技术.求是科技,人民邮电出版社,2002.8.
    [29]龚静,田小梅。基于文本表示的特征项权值计算方法[J].电脑开发与应用。2008
    [30]张殿勇.互联网内容识别和中文文本信息过滤.辽宁行政学院学报,2003.3,Vol.5.No.3：88-89
    [31]张殿勇.互联网内容识别和中文文本信息过滤.辽宁行政学院学报,2003.3,Vol.5.No.3：88-89
    [32]Lin Shian-Hua. Extracting classification knowledge of internet documents with mining term associations:A semantic approach. ACM SIGIR Conference, August 1998:241-249
    [33]胡健陆一鸣马范援基于HTML文档结构的向量空间模型的改进[D]上海交通大学

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700