基于信息搜集与内容分析的互联网不良信息监测技术研究

英文题名：Research of Technologies for Detecting Bad Information in Internet Based on Information Gathering and Content Analysis
作者：黄旭
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息安全 ; 内容安全 ; 搜索策略 ; 重复串 ; 贝叶斯理论 ; 反馈机制
英文关键词：Information Security ; Content Security ; Search Strategy ; Repeats ; Bayesian Theory ; Feedback
学位年度：2008
导师：朱艳琴
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2008-04-01

摘要

Internet以其前所未有的信息传播能力在给人们生活带来巨大便利的同时,也成为反动、色情、暴力等不良信息的载体。这些不良信息,尤其是有关国家安全的敏感信息借助于Internet传播,成为一个危害极大的社会问题。从海量信息中迅速有效地识别这类不良信息,进而阻止其非法传播,确保网上信息内容安全,已成为内容安全领域的重要研究课题。
     目前相关的研究大多集中在网关或用户端的信息过滤与自动屏蔽上,而国家安全部门对嫌疑站点进行主动核查,大多采用手工的形式,效率低下。为解决此类问题,本文以信息搜集与内容分析为基本思路,围绕不良信息的自动发现以及处理展开研究工作,深入研究了互联网结构体系、自然语言处理、人工智能与机器学习等相关原理与技术,具体工作涉及网页采集、关键词形式特征分析、文本特征提取、文本分类等方面。文章首先从Web结构入手,研究了基于内容的链接权重计算方法,提出基于内容评价的爬虫搜索策略;结合不良信息的固有特征,分析了不良信息形式化特点,同时针对不良信息隐蔽、多变的特点,研究了基于重复串的特征提取方法;基于贝叶斯理论,提出了实时文本分类器的设计方案,并提出文档特征反馈机制以提高分类性能。最后结合现实网络环境,提出一种Internet不良信息监测平台的实现框架。
     在Internet应用飞速发展的今天,本文研究工作对于提高相关部门工作效率、净化网络环境、促进构建和谐社会具有一定的积极意义,成为网络环境下内容安全领域的一次有益探索。同时,相关研究成果促进了网络、自然语言处理、人工智能等技术在信息安全领域的协同应用。
Internet has a huge capability of information promulgating, and it brings advantage to web users. At the same time, Internet becomes a carrier of bad information about rebellion, eroticism, and violence. The bad information, especially the sensitive information on national security, diffused in Internet becomes a serious social problem. How to distinguish the bad information rapidly and effectively in order to prevent them from diffusion, to ensure the safety of information in Internet, becomes a serious task in content security.
     Some correlative research concentrates on information filtering and auto-shield at gateway or client computer. But the active check to suspicious site is done by national security department mostly by means of inefficient handiwork. To solve it, many thoughts were established in this paper based on information gathering and contend analysis, and start off the research by surrounding how to gather and process the bad information. On the whole, this paper studied some correlative principles and technologies of the web system, nature language process, artificial intelligence and machine learning, etc. Firstly, this paper researched the Web structure and the way to calculate the hyperlinks’weight, advanced the crawler’s search strategy based on content evaluation. Secondly, it analysed the formalization feather of the bad information, and then researched the repeats-based term extraction algorithm aiming at the bad information character which is concealment and levity. Thirdly, this paper proposed a real-time text categorization method based on Bayesian Theory, and put forward the feedback of file character to improve the performance of classifier. And finally, it advanced a structure of a system to find the bad information in Internet.
     Nowadays, it is well known for the rapid development of the application of internet. This paper has active significance to improve the efficiency of correlative department, clean the web environment, and accelerate to construct harmonious society. It is useful for exploration of content security in Internet. Moreover, the fruit of this paper is valuable to the cooperating of network, nature language process, and artificial intelligence in information security.

引文

[1] 中国互联网络信息中心. 第 21 次中国互联网络发展状况统计报告[R]. 北京:中国互联网络信息中心(CNNIC). 2008 年 1 月 17 日.
    [2] 公安部. 公安机关打击利用互联网违法犯罪活动的十个典型案例[EB/OL]. http://www.mps.gov.cn/. 2007 年 4 月 13 日.
    [3] The State Key Laboratory Of Information Security (SKLOIS) [EB/OL]. http://www.is. iscas.ac.cn/English/pages/Technical Field.htm.
    [4] 宁家骏. 信息内容安全[M]. 贵阳: 贵州科技出版社. 2004.
    [5] Peter A. Fletcher and Kieran G. Larkin, Direct embedding and detection of RST invariant watermarks[A] IH2002[C], 2002, 129-143.
    [6] 曲建华. Web 上的信息过滤问题研究[D]. 济南: 山东师范大学. 2003.
    [7] 文自勇. 分布式网络监听系统研究与实现[D]. 成都: 西南交通大学. 2005.
    [8] 郑海春. 网络监听技术的研究与应用[D]. 成都: 西南石油学院. 2003.
    [9] 谭建龙. 串匹配算法及其在网络内容分析中的应用[D]. 北京: 中国科学院计算技术研究所. 2003.
    [10] 熊静娴,李生红. 面向不良文本信息监控的概念网技术研究[J]. 计算机工程与应用. 2006,42(3): 183-186.
    [11] 黄海英,林士敏,严小卫. 基于概念空间的文本分类研究[J]. 计算机科学. 2003, 30(3): 46-49.
    [12] 郭莉,张吉,谭建龙. 基于后缀树模型的文本实时分类系统的研究和实现[J]. 中文信息学报. 2005, 19(5): 16-23.
    [13] 万中英,王明文,廖海波. 基于投影寻踪的中文网页分类算法[J]. 中文信息学报. 2005, 19(4): 60-67.
    [14] 林鸿飞,姚天顺. 基于示例的中文文本过滤模型[J]. 大连理工大学学报. 2000, 40(3): 375-378.
    [15] 樊兴华,孙茂松. 一种高性能的两类中文文本分类方法[J]. 计算机学报. 2006, 29(1): 124-131.
    [16] 卢军,卢显良,韩宏,任立勇. 实时网络信息过滤系统的设计与实现[J]. 计算机应用. 2002, 22(10): 24-25.
    [17] Pawlak Z. Rough Sets: Theoretical Aspects of Reasoning about Data[M]. Kluwer Academic Publisher. 1992.
    [18] W. Cohen, Fast effective rule induction[A]. In: Machine Learning Proceedings of the Twelfth International Conference[C]. Lake Taho,California, Mongan Kanfmann, 1995, 115-123.
    [19] R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications[A]. In: Proc. of the 11th Annual Conf. on Computational Learning Theory[C]. New York: ACM Press, 1998, 80-91.
    [20] Yang Y. Expert network: effective and efficient learning from human decisions in text categorization and retrieval[A]. In 17th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’94)[C]. CA USA:[s.n.], 1994. 13-22.
    [21] Y. Yang, X. Lin. A re-examination of text categorization methods[A]. In: the 22nd Annual Int’l ACM SIGIR Conf. on Research and Development in the Information Retrieval[C]. New York: ACM Press, 1999.
    [22] Cheeseman P,Kelly J,Self M,et al. Autoclass: a bayesian classification system[A]. Proc Fifth Int Conf on Machine Learning[C]. San Mateo, CaJifornia:Morgan Kaufmann, 1988. 54-64.
    [23] Thorsten J. Text categorization with support vector machines: learning with many relevant features[A].European Conference on Machine Learning(ECML)[C]. Dortmund, German: Springer, 1998. 137-142.
    [24] 陈文亮,朱靖波,朱慕华,姚天顺. 基于领域词典的文本特征表示[J]. 计算机研究与发展. 2005, 42(12): 2155-2160.
    [25] 胡吉祥,许洪波,刘悦,程学旗. 重复串特征提取算法及其在文本聚类中的应用[J]. 计算机工程. 2007, 33(2): 65-67.
    [26] 杜亚军,严兵,宋亮. 爬行虫算法设计与程序实现[J]. 计算机应用. 2004, 24(1): 33-35
    [27] 高克宁,柴桥子,张斌,马安香. 支持 Web 信息分类的高性能蜘蛛程序[J]. 小型微型计算机系统. 2006, 27(7): 1308-1312.
    [28] 欧阳柳波,李学勇,李国徽,王鑫. 网络蜘蛛搜索策略进展研究[J]. 小型微型计算机系统, 2005, 26(4): 703-706.
    [29] 郭晔. Internet 中的页面价值快速算法模型研究[J]. 微电子学与计算机, 2007, 24(8): 139-141.
    [30] Gonzalo Navarro,Mathieu Raffinot. Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences[M]. 北京: 电子工业出版社. 2007.
    [31] 贺龙涛,方滨兴,余翔湛. 一种时间复杂度最优的精确串匹配算法[J]. 软件学报. 2005, 16(5): 676-683.
    [32] Tse. 2004. Home page of tiny search engine[EB/OL]. http://net.pku.edu.cn/~webg/src/ TSE/.
    [33] 左雅. 网页设计基础[M]. 北京: 机械工业出版社. 2005.
    [34] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. BernersLee, R. Feilding, “Hypertext Transfer Protocol-HTTP/1.1,” RFC2616[S], June 1999. http:// www. rfc-editor.org/rfc/rfc2616.txt .
    [35] James F. Kurose, Keith W. Ross. Computer Networking: A Top-Down Approach Featuring the Internet(Third Edition)[M]. 北京: 高等教育出版社. 2005.
    [36] Cooper C, Frieze A. Crawling on simple models of Web graphs[J]. Internet Mathematics, 2003, 1(1): 57-90.
    [37] Klenberg J M. Authoritative sources in a Hyperlinked environment[A]. In: Proc. 9th ACM-SIAM Symp[C]. Discrete Algorithms, ACM Press,New York and SIAM Press, Philadelphia, 1998, 668-677.
    [38] 余祥宣,崔国华,邹海明. 计算机算法基础[M]. 武汉: 华中科技大学出版社. 2000.
    [39] 严蔚敏,吴伟民. 数据结构[M]. 北京: 清华大学出版社. 1992.
    [40] 袁薇,高淼. 搜索引擎系统中个性化机制的研究[J]. 微电子学与计算机, 2006, 23(2): 68-75.
    [41] Active Network Working Croup. Architechural Framework for Active Networks Version 1.0 [DB/OL]. http://www.cc.gatech.edu/projects/canes/arch-1-0.ps, 2003/ 2004.
    [42] 刘琰,罗军勇,王清贤,常斌. Internet 信息采集技术研究[J]. 计算机应用与软件, 2006, 23(4): 13-16.
    [43] 李晓明,闫宏飞,王继民. 搜索引擎——原理、技术与系统[M]. 北京: 科学出版社. 2005.
    [44] 郭学理,张健. 网络程序设计[M]. 武昌: 武汉大学出版社. 2004.
    [45] Jeff Heaton. Programming Spiders, Bots, and Aggregators in Java[M]. 北京: 电子工业出版社. 2002.
    [46] Menczer F. Complementing search engines with online Web mining agents[J]. Decision Support Systems, 2003,35(2):195-212.
    [47] 徐宝文,张卫丰. 搜索引擎与信息获取技术[M]. 北京: 清华大学出版社.2003.
    [48] 蔡自兴,徐光祐. 人工智能及其应用(第三版)[M]. 北京: 清华大学出版社.2004.
    [49] E Zitzler, L Thiele. Multiobjective evolutionary alogrithms: a comparative case study and the strength pareto approach[J]. IEEE Trans. on Evolutionary Computation,1999, 4(3): 257-271.
    [50] 单松巍,冯是聪,李晓明. 几种典型特征选取方法在中文网页分类上的效果比较[J]. 计算机工程与应用, 2003,39(22): 146-148.
    [51] 张亮,冯冲,陈肇雄,黄河燕. 基于语句相似度计算的 FAQ 自动回复系统设计与实现[J]. 小型微型计算机系统. 2006, 27(4): 720-723.
    [52] Rocha C, Schwabe D, Aragao M P. A Hybrid Approach for searching in the Semantic Web[A]. The Thirteenth International World Wide Web Conference[C], 2004-05: 17-22.
    [53] 朱巧明,李培峰,吴娴,朱晓旭. 中文信息处理技术教程[M]. 北京: 清华大学出版社. 2005.
    [54] 代建英. 汉语自动分词系统的研究与实现[D]. 重庆: 重庆大学,2005.
    [55] 李东艳. 互联网信息内容安全过滤方法研究[D]. 太原: 山西大学, 2004.
    [56] 蒋宗礼,姜守旭. 形式语言与自动机理论(第 2 版)[M]. 北京: 清华大学出版社. 2007.
    [57] Eric J Glover, Kostas Tsioutsiouliklis, Steve Lawrence et al. Using Web structure for classifying and describing Web pages[A]. The WWW2002[C], Honolulu, Hawii, 2002.
    [58] 李晓明,朱家稷,闫宏飞. 互联网上主题信息的一种收集与处理模型及其应用[J]. 计算机研究与发展, 2003, 40(12): 1667-1671.
    [59] 聂哲. 基于 WEB 的面向主题搜索引擎的设计与实现[J]. 计算机工程与设计, 2003, 24(2): 60-62.
    [60] Chien L F. PAT-tree-based Adaptive Key Phrase Extraction for Intelligent Chinese Information Retrieval[J]. Information Process and Management, 1999, 35(4): 501-521.
    [61] Hyun-Jun Kim,Jenu Shrestha,Heung-Nam Kim,and Geun-Sik Jo : User Action Based Adaptive Learning with Weighted Bayesian Classification for Filtering Spam Mail[J]. A. Sattar and B.H. Kang (Eds.): AI 2006, LNAI 4304, 2006,pp. 790-798.
    [62] Sebastinai Fabrizio. A tutorial on Automated text categorization[EB/OL]. http://faure. iei.pi.cnr.it/~fabrizio/ATCtutorialprogram.html.
    [63] 冯是聪,张志刚,李晓明. 一种中文网页自动分类方法的实现及应用[J]. 计算机工程, 2004, 30(5): 19-20.
    [64] (美)Tom M.Mitchell 著. 机器学习[M]. 曾华军,张银奎等译. 北京: 机械工业出版社. 2003.
    [65] Fabrizio Sebastiani. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys. Vol.34, No.1, March 2002.
    [66] 刘远超,王晓龙,徐志明,关毅. 文档聚类综述[J], 中文信息学报, 2006, 20(3): 55-62.
    [67] Richard O. Duda, Peter E. Hart, David G. Stork. Pattern Classification[M]. 北京: 机械工业出版社. 2003.
    [68] 史永丰,赵燕平,许榕生. 高速网络内容监控系统的设计与实现[J]. 计算机科学. 2004, 31(9): 87-89.
    [69] Concept of Feedback[EB/OL]. http://www.eepw.com.cn/article/62908.htm.
    [70] Collins M, Singer Y. Unsupervised models for named entity classification[A]. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora[C]. College Park, MD, 1999.90-99.
    [71] 周志华. 半监督学习的研究[R]. 南京: 南京大学, 2007.
    [72] 刘琦,李建华. 网络内容安全监管系统的框架及其关键技术[J]. 计算机工程. 2003, 29(2): 287-289.