基于串匹配和文本分类的中文网页过滤系统设计

英文题名：The Chinese Web Page Filter System Design Based on String Match and Text Categorization
作者：张慎
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：过滤 ; 字符串匹配 ; 文本分类
英文关键词：Filter ; String Match ; Text Categorization
学位年度：2009
导师：张爱华
学科代码：081201
学位授予单位：华中科技大学
论文提交日期：2009-05-01

摘要

近年来随着互联网迅猛发展和日益普及,网络已经成为人们获取信息的主要来源之一。然而互联网上的信息良莠不齐,不良信息的泛滥给人们尤其是未成年人的身心健康带来了极大的危害。阻止和过滤掉互联网上的不良信息对保护青少年极其重要。网络上的大多数信息是以文字的形式存在,因此,对网页文本过滤进行研究,提供高准确率和实时的文本过滤越来越重要。
     系统采用URL(Uniform Resource Locator)过滤、字符串匹配过滤和文本分类过滤相结合的过滤方法。建立URL黑名单机制,对黑名单上的页面直接过滤。采用快速的字符串匹配技术对文章标题、超链接内容和文本前几段直接进行敏感词汇搜索,实现初次过滤。然后再利用文本分类技术进一步判定文本属性,过滤掉不良文本。并且将检测到的不良页面的URL信息反馈给黑名单,提高系统对其后页面的处理速度。
     在对IE浏览器体系结构进行分析的基础上,采用ActiveX控件和后台程序相结合的方法来实现过滤,其中ActiveX控件负责对IE浏览器的访问进行监控,将浏览信息传给后台程序,同时接受后台程序的命令,对浏览事件进行阻止或重定向;后台程序负责内容过滤的处理、数据库的查询和维护。
     最后,设计实现了一个基于IE浏览器的网页过滤系统原型。在自建词库和文本库基础上进行试验,结果表明总体识别率和处理速度上基本上能够满足不良信息过滤的要求。
With the rapid development and the increasing popularity of Internet, it has become one of the main resources of information. However the good information and the bad are intermingled on the Internet. People, especially teenagers could be seriously impacted by the unhealthy information. So it is of great importance to block or filter the bad information on the Internet. Most informations on the Internet are existed as letters, therefore it becomes more and more important to research the web text filter and offer the text filter with high veracity and real time.
     The system uses a combination of URL (Uniform Resource Locator) filter, string match filter and text categorization filter as the filter approach. Firstly, it blocks URLs on the blacklist. Secondly, it uses the rapid keyword matching technology to search for the sensitive words in the article titles, hyperlinks and a few beginning paragraphs. Thirdly, it uses text categorization technology to do further judging. The system will also feed back bad URLs to increase the processing rate.
     This system adopts a method united the ActiveX control and backend program to achieve the filter after analyzing the IE browser system structure. The ActiveX control monitors the visits of IE browser, transports the content to the backend program and also accepts orders from the backend program. The backend program filters the content, queries and maintains the database.
     The web filter prototype system is based on the IE browser. Testing on the self-create vocabulary and text, it indicates that the recognition accuracy and processing speed can meet the requirements of bad information filter.

引文

[1]李松涛.色情网站数量达3.7亿个威胁青少年成长.中国青年报.北京. 2007, 5(24): 3~3
    [2]近半数青少年接触过黄色网站.中国青少年研究中心. 2009, 5(14): 3~5
    [3] Uri Hanani, Bracha Shapira, Peretz Shoval. Information Filtering: Overview of Issues, Research and Systems. User Modeling and User-Adapted Interaction , 2001, 11(3): 203~259
    [4] Zhang Jianping, Qin Jason , Yan Qiuming. The Role of URLs in Objectionable Web Content Categorization. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference On Web Intelligence. 2006. 277~293
    [5]黄晓斌,邱明辉.网络信息过滤方法的比较研究.大学图书馆学报, 2005, 1: 42~48
    [6] Peter J. Denning. Electronic Junk. Communications of ACM, 1982, 25(3): 163~165
    [7] Malone. T, Grant. K, Turbak.F, et al. Intelligent information sharing systems. Communications of the ACM,1987, 30(5): 390~402
    [8] P. Resnick and J. Miller . PICS: Internet access controls without censorship. Communications of the ACM, 1996, 39(10): 87~93
    [9] Elisa Bertino, Elena Ferrari, Andrea Perego. Content-based filtering of Web documents: the Max system and the EUFORBIA project. International Journal of Information Security, 2003, 10(8): 46~58
    [10] Polpinij. J, Chotthanom. A, Sibunruang. C, et al. Content-Based Text Classifiers for Pornographic Web Filtering. In: IEEE International Conference on Systems, Man, and Cybernetics. Taipei, 2006. New York: IEEE, 2006, 2: 1481~1485
    [11]唐兴全,王敬成,白晓革等. HNC反色情知识库建设.见:孙茂松.全国第八届计算语言学联合学术会议(JSCL-2005)论文集.北京:清华大学出版社, 2005. 660~662
    [12] Min Jin, Huang Xuanqing. Text filtering system based on topic and sentiment classification. Computer Engineering, 2007, 33(2): 163~164
    [13]殷建平.汉语自动分词方法.计算机工程与科学. 1998, 15(6): 20~23
    [14] Chen Ding, Chi Chi-Hung, Deng Jing, et al. Centralized Content-Based Web Filtering and Blocking: How Far Can It Go?. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, 1999, 2: 115~119
    [15]姚翌.网络浏览过滤软件汇萃.计算机与网络, 2003, 15: 29~34
    [16]马明臣.还网络一片净土,软件报, 2002, 10(14): 4~4
    [17] Isenman M. E, Shasha D. E. Performance and architectural issues for string matching. IEEE Transactions on Computers, 1990, 39(2): 238~250
    [18] R. S. Boyer, J . S. Moore. A fast string searching algorithm. Communications of the ACM, 1977, 20(10): 762~772
    [19] Alfred V. Aho and Margaret J. Corasiek. Efficient String Matching: An Aid to Bibliographic Seareh. Communications of the ACM, 1975, 18(6): 333~340
    [20]蔡晓妍,戴冠中,杨黎斌.改进的多模式字符串匹配算法.计算机应用, 2007, 27(6): 1415~1417
    [21] R. Sekar, Y. Guang, S. Verma, T. Shanbhag. A High-Performance Network Intrusion Detection System. In: Proceedings of the 6th ACM conference on Computer and communications security. Singapore. 1999. New York: ACM, 1999. 8~17
    [22] C.Jason Coit, Stuart Staniford. Toward Faster String Matching for Intrusion Deteetion or Exceeding the Speed of Snort. In: DARPA Information Survivability Conference & ExpositionΙΙ,2001, 1(12-14): 367~373
    [23] R.NIGEL HORSPOOL. Practical Fast Searching in Strings. SOFTWARE-PRACTI- CE AND EXPERIENCE, 1980, 10: 501~506
    [24]黄云峰.计算机中文分词技术及其在数字化侦查中的应用研究.福建警察学院学报, 2008, 104(4): 28~31
    [25]瞿锋,陈纪元.汉语自动分词算法综述,福建电脑, 2006, 4: 23~25
    [26]张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用.计算机工程, 2006, 32(19): 76~78
    [27] T. M.COVER, P.E.Hart. Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 1967, 13(3): 21~27
    [28] P. Soucy, G. W. Mineau. A simple KNN algorithm for text categorization . In: Proceedings of the IEEE International Conference on Data Mining, 2001: 647~648.
    [29]张宁,贾自艳,史忠植.使用KNN算法的文本分类.计算机工程, 2005, 31(8): 171~185
    [30]吴春颖,王士同.一种改进的KNN Web文本分类方法.计算机应用研究, 2008, 25(11): 3275~3277
    [31]张晓辉,李莹,王华勇.应用特征聚合进行中文文本分类的改进KNN算法.东北大学学报, 2003, 24(3): 229~232
    [32]张宇,刘挺,文助.基于改进贝叶斯模型的问题分类.中文信息学报. 2004, 19(2): 100~105
    [33]张付志,伍朝辉,姚芳.基于贝叶斯算法的垃圾邮件过滤技术的研究与改进.燕山大学学报, 2009, 33(1): 47~52
    [34] Vladimir N.Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 1999, 10(5): 988~999
    [35]张学工.关于统计学习理论与支持向量机.自动化学报, 2000, 26(1): 32~42
    [36]王鹏,朱小燕.基于RBF核的SVM的模型选择及其应用.计算机工程与应用, 2003, 24: 72~73
    [37] T. Joachims. Text Classification with Support Vector Machines:Learning with Many Relevant Features. Machine Learning ECML-98, 1998: 137~142.
    [38]范晓,申铉京.基于IE浏览器的色情图片过滤器.吉林大学学报, 2004, 22(6): 631~637
    [39] Ashraf. F, Alhajj. R. ClusTex.Information Extraction from HTML Pages. In : 21st International Conference on Advanced Information Networking and Applications Workshops. 2007, 1: 355~360
    [40] Hu Zhongyi, Xiao Lei, Yang Huanchun, et al. A Developing Method of ActiveX Component for e-Commerce.International Symposium on Pervasive Computing and Applications, 2006, 8(3-5): 232~234

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700