一个WEB文本过滤系统设计与实现

英文题名：Design and Implementation of WEB Text Filtering System
作者：沈凤仙
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：网页过滤 ; 在线过滤 ; 离线过滤 ; 自适应信息过滤 ; 语义倾向
英文关键词：Web Page Filtering ; Online Filtering ; Offline Filtering ; Adaptive Information Filtering ; Semantic Orientation
学位年度：2009
导师：朱巧明
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2009-05-01

摘要

随着互联网的快速发展,网络上的信息呈爆炸式增长,文本信息过滤技术的研究取得了很大的进展,Web文本信息过滤技术已成为一个研究热点。本文在前期课题IPCG控制网关的研究基础上,为了提高该计费网关对公共信息网络服务的综合监管能力,通过研究Linux下实时内容过滤和文本过滤等相关技术,设计并实现了一个基于IPCG控制网关的Web文本信息过滤系统。
     本文首先给出了系统总体框架以及设计目标,并提出了一种分布式过滤系统的实现方式。系统由中央预警模块统一管理、在线过滤和离线过滤相结合。分布式数据库的同步借鉴OSPF路由协议中数据库同步算法,实现全网过滤信息的通用性。
     实时在线过滤模块,包括了数据包预处理和基于IP地址及关键词过滤两个子过程。数据包预处理过程主要针对Web页面进行数据分析和结构分析,解析出正确的页面数据信息;基于IP和基于关键词的过滤过程,采用了哈希树结构来组织IP黑名单列表和缓存拼接策略存储过滤内容,关键词过滤结合统计信息综合判定。
     离线过滤模块对正例类和不确定类做进一步的离线分析,更新实时在线过滤模块的IP黑名单列表和过滤关键字列表。离线过滤采用改进的特征词提取算法和改进的过滤策略。改进的特征词提取算法,综合考虑了特征词长、网页结构特征和词汇的感情色彩等;改进的过滤策略过滤初期采用SVM算法,中后期采用改进的自适应模板过滤法。模板的更新采用改进的模板系数调整策略,并引入特征衰减因子来提高过滤的准确率。
     实验表明,本文提出的方法既能保证内容过滤分析和数据报流通相互独立,又能提高在线过滤的速度和过滤的正确率。
With the rapid development of Internet,the amount of information increases in an explosive way.Text information filtering technology has made great progress and information filtering based on web text has become a research hotspot.The pre-topic of this paper is the research of IPCG gateway and the research of this paper is how to improve the gateway's supervision capability for the public services.By studying the real-time content filtering under the Linux and the relevant technology of text filtering,this paper proposes and implements a web text filtering system based on IPCG gateway.
     Firstly,this paper shows the overall framework of the system which combines real-time online filtering with offline filtering,and puts forward a distributed filtering system which refers the database synchronization algorithm of OSPF routing protocol.
     Real-time online filtering module includes two processes.One is the pretreatment of packets,and the other is the IP-based and the keyword-based filtering.The pretreatment of packets aims at getting correct data information by web content analysis and web structural analysis of web pages.The IP-based and the keyword-based filtering use the hash-tree structure to organize IP blacklist and the cache strategy to storage filtering content.The keyword-based filtering which combined with statistical information assigns the category to the page.
     Offline filtering model makes further offline analysis for the example and the unascertained page,and then updates the IP blacklist list and the keyword list used by online filtering module.This paper puts forward the feature extraction algorithm and the filtering strategy.The feature extraction algorithm considers the length of features,the structural information of pages and the semantic orientation information of features.The filtering strategy uses SVM at initial filtering stages and uses the improved adaptive template-based algorithm in latter stages.In order to update profile,it uses the improved coefficient adjustment strategy,and uses the feature attenuation factor.
     The experimental results show that the method proposed in this paper can ensure filtering process and data transfer independently,while it can improve both the speed and the accuracy of online filtering.

引文

[1]Akbas Ertugrul.Next generation flitering:offline filtering enhanced proxy architecture for web content filtering[A].In Proceedings of 23rd International Symposium on Computer and Information Sciences[C].Turkey,2008:1-4.
    [2]Lee P Y,Hui S C.An intelligent categorization engine for bilingual web content filtering[J].IEEE Transactions on Multimedia,2005,7(6):1183-1190.
    [3]Polpinij Jantima,Chotthanom Anirut,Sibunruang Chumask.Content-based text classifiers for pornographic web flitering[A].In Proceedings of 2006 IEEE Intemational Conference on Systems,Man and Cybemetics[C].Taipei,2006:1481-1485.
    [4]Guermazi Radhouane,Hammami Mohamed.Using a semi-automatic keyword dictionary for improving Violent web site filtering[A].In Proceedings of Third International IEEE Conference on Signal-Image rechnologies and Internet-Based System[C].Shanghai,2007:337-344.
    [5]Guermazi Radhouane,Hammami Mohamed,Hamadou Abdelmajid Ben.WebAngels filter:a violent web filtering engine using textual and structural content-based analysis[A].In Proceedings of 8th Industrial Conference on Data Milling[C].Leipzig,2008:268-282.
    [6]Chau Michael,Chen Hsinchum.A machine learning approach to web page flitering using comem and structure analysis[J].Decision Support Systems,2008,44(2):482-494.
    [7]吕宏伟,唐小力,王申.网页内容过滤技术中的特征提取[J].计算机工程与应用,2004,31:145-146.
    [8]唐坚刚,熊国萍.自适应不良网页过滤模式的研究与实践[J].计算机工程与设计,2008,29(20):5324-5326.
    [9]梁理,黄樟钦,侯义斌.网络信息过滤系统(NIFS)的研究与实现[J].小型微型计算机系统,2003,24(2):195-198.
    [10]李艳玲.基于内容的不良信息文本实时识别方法研究[J].计算机与信息技术,2007,(05):30-32.
    [11]白广奇.网页内容过滤的关键技术研究及实现[D].山东:山东大学,2005.
    [12]吴立德.大规模中文文本处理[M].上海:上海复旦大学出版社,1997.
    [13]Van Rijsbergen C J.Information Retrieval[M].Dordrecht:springer netherlands,1979.
    [14]Salton Gerard,McGill Michael J.Introduction to modern information retrieval[R].McGraw-Hill Book Company,1983.
    [15]Hull D A.The TREC-6 filtering trace:description and analysis[A].In Proceeding of the 6th Text Retrieval Conference(TREC-6)[C].1997:45-59.
    [16]Hull D A.The TREC-7 filtering trace:description and aanlysis[A].In Proceeding of the 7th Text Retrieval Conference(TREC-7)[C].1998:33-47.
    [17]王金宝.基于增量学习和阈值优化的自适应信息过滤[D].大连:大连理工大学,2005.
    [18]许洪波.大规模信息过滤技术研究及其在Web问答系统中的应用[D].北京:中国科学院计算技术研究所,2003.
    [19]周澔宇.基于URL的网页内容过滤器的设计与实现[J].计算机工程,2006,32(7):81-83.
    [20]陈鸿斌,张建标.基于Netfilter的数据采集技术在实时内容过滤中的应用[J].计算机应用,2006,26:192-194.
    [21]谢希仁.计算机网络(第4版)[M].北京.电子工业出版社,2005.
    [22]Kurose James F,Ross Keith W.Computer Networking[M].北京:高等教育出版社,2005.
    [23]于海燕,陈晓江,冯健等.Web文本内容过滤方法的研究[J].微电子学与计算机,2006,23(9):51-54.
    [24]周诚.基于Netfilter技术的内容过滤技术研究与实现[J].计算机系统应用,2007,(4):18-20.
    [25]Aho A,Corasick M.Efficient string matching:an aid to bibliographic seatoh[J].Communication ofthe ACM,1975,18(6):333-340.
    [26]陈剑.内容过滤智能防火墙的设计与实现[D].西安:西安交通大学,2001.
    [27]黄鑫,尹宝林.多层次多策略的分布式网络信息过滤系统模型[J].北京航空航天大学学报,2003,29(10):919-922.
    [28]刘七.基于Web文本内容的信息过滤系统的研究与设计[D].南京:南京理工大学,2004.
    [29]徐琳宏,林鸿飞,杨志豪.基于语义理解的文本倾向性识别机制[J].中文信息学报,2007,21(1):96-100.
    [30]朱嫣岗,闵锦,周雅倩.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20.
    [31]董振东,董强.知网简介[EB/OL].http://www.keenage.com.
    [32]刘群,李素建.基于《知网》的词汇语义相似度的计算[A].第三届汉语词汇语义学研讨会,台北,2002.
    [33]袁新成.基于向量空间模型的自适应文本过滤研究[D].哈尔滨:哈尔滨工业大学,2006.
    [34]Gao Zhong,Lu Guanming,Dong Hao,et al.Applying a novel combined classifier for hypertext classification in pornographic web filtering[A].In Proceedings of 2008 International Conference on Internet Computing in Science and Engineering[C].Harbin,2008:270-273.
    [35]朱巧明,李培峰,吴娴等.中文信息处理技术教程[M].北京清华大学出版社,2005.
    [36]何静,刘海燕,官云战.内容过滤中过滤模板的改进技术研究[J].通信学报,2004,25(3):112-118.
    [37]马亮,陈群秀,蔡莲红.一种改进的自适应文本信息过滤模型[J].计算机研究与发展,2005,42(1):79-84.
    [38]黄宣菁,夏迎炬,吴立德.基于向量空间模型的文本过滤系统[J].软件学报,2003,14(3):435-442.
    [39]董梅.文本内容的信息过滤技术研究[D].合肥:合肥工业大学,2006.
    [40]屈军,林旭.文本分类中特征提取方法的比较与分析[J].现代计算机,2007,257:10-12.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700