互联网信息内容安全过滤方法研究

英文题名：Study of the Information Content Securty Filter Method in WEB
作者：李东艳
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：内容过滤 ; 信息安全 ; 非法文本 ; 逻辑规则提取 ; 实例学习
英文关键词：Information Security ; Content Filter ; Illegitimate Text ; Produce of Logical Rule ; Learning from Examples
学位年度：2004
导师：张永奎
学科代码：081202
学位授予单位：山西大学
论文提交日期：2004-06-01

摘要

互联网信息内容安全过滤(Information Content Security Fiiter)是指从海量的WEB文本中识别出含有不良内容的非法文本，以将其屏蔽。目前它已经成为信息过滤的一个新的研究领域。
     本文研究了内容安全过滤中的若干关键技术，包括文本表示，非法文本的识别算法及对文本动态学习的实现等。本文还设计了一个信息内容安全过滤(ICSF)实验系统，实现了对非法文本的训练、规则的提取、更新以及对新文档的判别等功能。
     本文的工作和创新主要体现在以下几个方面：
     1．系统地分析了非法文本的特点，总结了非法文本内容和用词的特征，并给出其形式化表示。
     2．通过基于规则的算法实现信息内容过滤。我们采用实例学习方法，在大量训练实例的基础上，将改进的用于逻辑规则提取的OCAT挖掘算法用于文本分类规则的提取，分别产生针对正例集和反例集的识别规则，对文本进行二分分类。同时，通过分析非法文本所特有的用词形式的特征，给出判别规则来计算文本含有非法文本用词特征的可信度。最后，结合训练集的提取规则与特殊词规则，对新文档进行判别。
     3．对不同规则采用不同的更新算法，实现对新出现的非法文档的自动识别。我们根据误判文档的反馈信息修改逻辑规则，使其不断增加对新非法文档的识别能力，实现规则的增量式学习。并提出了特殊词自动识别算法，对出现在新的非法文本中的特殊词进行自动识别，以扩展作为特殊词识别规则基础的特殊词表，实现对特殊词识别规则的更新。
The international information content security filter refers to identify the illegitimate text that include ill content and take out them. Along with the increase of the illegitimate text in WEB, content security filter has become a new study domain of information filter.
    Some key problems of content security filter have been studied in our paper, for example, the representation of train texts, identification of illegitimate text and the automatic learning to the new text. We also design an ICSF experimental system to implement all the functions that be mentioned above.
    Main work and innovation in this paper are:
    1.The characteristic of illegitimate text has been roundly analysis, and we summarize the content and vocable feature of illegitimate texts and put forward their formalized express.
    2.We realize content security filter by using the rule-based approaches. Based on large numbers of train examples, we adopt learning from examples approach which implement produce rules by using extended OCAT algorithm to realize classification of text. At the same time, we put forward rules for special word to calculate the credibility of text. At last, we combine the train rules and special word rules to identify the new documents.
    3.Two automatic learning algorithms are used respectively to improve the produced rules. At first we modify the logical rules according to the feedback information to improve the ability of identify of the new illegitimate content and to implement the increment learning. We also present an algorithm to automaticly pick-up new special words in new illegitimate document. Then the system can catch new status to the new illegitimate information.

引文

[1] 严三九．论网络内容的管理．广州大学学报(社会科学版)．2002，1(5)：67-72
    [2] 林鸿飞．中文文本过滤的逻辑模型[博士论文]．东北大学，沈阳，2000
    [3] William W Cohen. Learning Rules that Classify Email. AAAI Spring Simposium on Machine Learning in information Access, 1996, 96(5): 18-25
    [4] Robert Cooley, Pang-Ning Tan, Jaideep Srivastava. WebSIFT: The Web Site Information Filter System. In Proceedings of the Workshop on Web Usage Analysis and User Profiling(WebKKD99), San Diego, 1999
    [5] Triantaphyllou Evangelos, A L Soyster and S R T Kumara. Generating logical expressions from positive and negative examples via a branch-and-bound approach. Computers and Operations Research, 1994, 21(2): 185-197
    [6] Deshpande A S, Triantaphyllu Evangelos. A greedy randomized adaptive search procedure(GRASP) for inference logical clauses from examples in polynomial time and some extensions. Mathematical and Computer Modelling, 1998, 27(1): 75-99
    [7] Salvador NietoSanchez, Triantaphyllu Evangelos, Donald Kraft. A feature mining based approach for the classification of text documents into disjoint classes Information. Processing and Mamagement, 2002, 38(4): 583-604
    [8] 曾春，邢春晓，周立柱．基于内容过滤的个性化搜索算法．软件学报，2003，14(5)：999-1004
    [9] 田范江，李丛蓉，王鼎兴．进化式信息过滤方法研究．软件学报，2000，11(3)：328-333
    [10] 梁理，黄樟钦，侯义斌．网络信息过滤系统(NIFS)的研究与实现．小型微型计算机系统，2003，24(2)：195-198
    [11] 刘琪，李建华．网络内容安全监管系统的框架及其关键技术．计算机工程，2003，29(2)：287-289
    [12] 孙春来，段米毅，毛克峰．基于内容过滤的网络监控技术研究．高技术通讯，2001，11(11)：36-38
    [13] 卢军，卢显良，韩宏，任立勇．实时网络信息过滤系统的设计与实现．计算机应用，2002，122(10)：24-25
    [14] 吴立德等．大规模中文文本处理．上海：复旦大学出版社，1997
    [15] 杨建林．信息检索模型与逻辑理论．情报学报．2000，19(5)：514-519．
    [16] Gerard Slaton, Edward A Fox, Harry Wu. Extended Boolean Information Retrieval. Communications of the ACM, 1983, 26(12): 1022-1036
    [17] 孙良．一种分布式智能信息检索系统的研究与实现[硕士论文]．浙江大学，杭州，2003
    [18] Carbonell J. Machine learning: Paradigms and methods. Bradford Books. MIT Press, Cambridge, Massachusetts, 1990
    [19] Piatetsk-Shapiro G, Frawley W J. Knowledge discovery in database. AAAI

    Press/MIT Press, Menlo Park, CA, 1991
    [20] Langley p. Elements of mathine learning. Morgan Kaufman, San Francisco, CA, 1996
    [21] Hsinchun Chen. Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms. Journal of the American Society For Information Science, 1995, 46(3): 194-216
    [22] Langley P, Simon H A. Applications of machine learning and rule induction.. Communications of the ACM, 1995, 38(11): 55-64
    [23] J Ross Quinlan. C4.5:Programs for Machine Learning. Morgan Kaufmann Publishers, 1993
    [24] Sazberg S L. Learning with nested generalized exemplars. Kluwer Academic Publisher, Norwell,Masachusetts. 1990
    [25] John G Cleary, Leonard E rrigg. An instance-based learning using an entropic distance measure. Proceedings of the International Conference on Maching learning, Morgan Kaufmann 1995:108-114
    [26] 张永奎，郭文宏，牛伟霞，李荣陆：网上中文信息过滤技术的研究，高技术通讯，2001专集(第一届中文信息处理发展国际研讨会论文集)，2001，126-128
    [27] 张永奎．基于分类模板的用户模型构造方法．山西大学学报，2002，25(2)：109-111
    [28] Huan Liu, Rudy Setiono. A probabilistic approach to feature selection - a filter solution. Proceedings of the 13th International Conference on Machine Learning(ICML'96), Bari, Italy, 1996: 319-327
    [29] H Almuallim, T G Dietterich. Learning boolean conception in the presence of many irrelevant features, Artificial Intelligence, 1994, 60(1-2): 279-305
    [30] K Kria, L A Rendell. The feature selection program: Traditional methods and a new algorithm. In AAAI-92, Proceedings Ninth National Conference on Artificial Intelligence, AAAI Press, The MIT Press, 1992: 129-134
    [31] G H John Kohavi, K Pfleger. Irrelevent feature and the subset selection problem. In Machine Learning. Proceedings of the Eleventh International Conference, Morgan Kanfmann Publishier, 1994: 121-129
    [32] A P Kamath, N Karmarker, K G Ramakrishnan, M C G Resende, A continuous approach to inductive inference. Math Progromming, 1992, 57(2): 215-238
    [33] A P Kamath, N Karmarker, K G Ramakrishnan, M C G Resende. An interior point approach to Boolean vector function synthesis. Proceedings of the 36-th MSCAS, 1994: 185-189
    [34] E Triantaphyllou. Inference of a minimum size Boolean function from examples by using a new efficient branch-and-bound approach. Journal of Global Optimization, 1994, 5(1): 69-94
    [35] E Boros. Dualization of aligned Boolean functions. RUTCOR Research Report RRR, Rutgers University, 1994: 9-94
    [36] E Boros, V Gurvich, P L Hammer, T Ibaraki, A Kogan. Structural analysis. and

    decomposition of partially defined Boolean functions, RUTCOR Research Report RRR, Rutgers University, 1994: 13-94
    [37] Triantaphyllou Evangelos, A L Soyster. A relationship between CNF and DNF systems which are derived from the same positive and negative examples. ORSA Journal on Computing, 1995, 7(3): 283-285
    [38] HUNT E, Martin J, Stone P. Experiment in Induction. Academic Press. New York, N. Y., USA, 1966
    [39] Michalski R S, Larson J B. Selection of the Most Representative Training Examples and Incremental Generation of VL1 Hypotheses: The Underlying Methodology and the Description of the Programs ESEL and AQI1. Technical Report No. UIUCDCS-R-78-867. University of Illionis at Urbana. Urbana, IL, USA, 1978: 322-346
    [40] Salvador Nieto Sanchez, Evangelos rriantaphyllou, Jianhua Chen, T Warren Liao. An Incremental Learning Algorithm for Constructing Boolean Functions From Positive and Negative Examples. Computers and Operations Research, 2002, 29(12): 1677-1700
    [41] 张永奎，李东艳．互联网中非法文本特征分析及其属性预选取新方法．计算机应用，2004，24(4)：113-115
    [42] 赖茂生，王延飞，赵丹群．计算机情报检索．北京：北京大学出版社，1996．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700