XML Engine安全网关语义过滤的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在庞杂的互联网信息中,不良信息以各种不同的方式,通过多种途径从不同的方面对人们造成了不良影响。因此,必要和有效的不良信息过滤对于建设健康、安全的互联网环境显得尤为重要。但是,传统的文本信息过滤算法仅能从结构对应的层次上进行判断,而无法实现文本的语义,很难满足当今信息智能化的要求。
     本课题结合计算语言学知识,提出并实现了一种语义分析的过滤方法,对于那些不能通过关键字匹配过滤而漏掉的长文本信息,通过语义分析,可以进行很好地鉴别处理,从而有效的防止大量不良垃圾信息的散播。
     本课题的先进性如下:1、针对各种自动分词方法中出现的问题,改进了具有自学习机制的智能词典的概念,并实现了智能词典的基本模型。该模型在分词的同时,实现了对新词的自学习功能,不需要人工干预,很好地完成了系统的智能性。分词算法采用正向和逆向最大匹配方法相结合的特点,分词的准确率大大提高,同时,配合词频库,能够有效地消解分词歧义,也是对分词准确率的进一步保证。2、通过对特征值算法的深入研究,基于TFIDF的特征值提取算法,在TFIDF稳定性的基础上引入词性系数来改善特征集的选取效果。采用潜在语义标注的方法,对不同词性的特征乘以不同的词性系数,突出不同词性的特征表示文档类别的能力,以减轻文本分类器的工作量,进一步提高处理的速度和效果。3、通过对几种主要的分类器算法的研究,依据贝叶斯算法性能高,复杂度低的特点,并针对项目的实际情况,批量大、速度快、分类种类少的特点,提出一套基于朴素贝叶斯算法的分类器模型,利用特征值的词性系数,利用统计方法对待分类文本进行训练分类。试验证明,该分类器算法具有很高的查全与查准率,为整个语义过滤模块的过滤质量提供了有效的保障。
     论文研究成果已经应用到国家支撑计划、广东省科技项目XML Engine安全网关上。在整个XML Engine中加入本课题的语义过滤模块,极大的阻止了对大量不良信息的智能过滤,进一步保证了整个XML Engine的安全性能。
Among the large quantity of complicated Internet information, some ill pieces have bad effects on many people in several different ways and from kinds of aspects. Therefore, necessary and effective filtrating for visiting network is an important aspect of setting up a healthy and safe network environment. However, the traditional methods of text message filter can only judge the layers according to the structure, but not the semantic of the text, which are hard to meet the needs of the intelligentialization.
     by combinating computational linguistics susbject konwledge, this article proposed and implemented a emantic analysis of filtering methods. For the long text message, that can not be filtered out by keword matching,we can do a better identification and processing through the semantic analysis,so as to ffectively prevent a large number of non-meaning infromation spreaded out.
     The advanced point of this thesis is mentioned as following: First, aiming at the problems of some word segmentation methods, the concept of intellective dictionary of auto-study protocol is improved, and the basic model of intellective dictionary is archived. This model archives the auto-study function of new words without human being interrupting, and realizes the intellective quality of system. This word segmentation algorithm combines the positive and negative direction max matching, which improves the accuracy of word segmentation. Meanwhile, according to the words frequency library, the algorithm can remove the different meanings of word segmentation, which ensures the accuracy of word segmentation. Second, through the research of the characteristic value algorithm deep, the distilling algorithm of characteristic value based on TFIDF, which imports word property coefficient to improve the characteristic set based on the stability the TFIDF. This algorithm uses the method of latent semantic label to help user analyze the semantic relationship, which multiplies different word property coefficient for different word characteristic. The advantage is highlighting the ability of special position expressing the sort of document, in order to relief the workload of word segmentation, and improve the speed of effective of treatment. Third, through the research of several main categorizer algorithm, based on Bayes algorithm, which has high quality and low complexity, aiming at the characteristic of big batches, fast speed and few sorts of projects, a set of Classifier models of Bayes algorithm is introduced, which uses the word characteristic coefficient and statistic method to sort for the relative degree. The experiment shows that, this categorizer algorithm has the ability of high comprehensive and exact search, which support effective guarantee for the filter quality of all the semantic filter module.
     The result of the thesis research has already been used in the XML Engine safe gateway, which is the technology project of Guangdong, with national support. Adding the semantic filter module to the whole XML Engine, prevents the intellective filtrating of quantity of bad information, and assures the safe quality of XML Engine.
引文
[1] BELKIN N J,CROFT W B.Information filtering and information retrieval:two sides of the same coin.Communication of the ACM,1992,35(2):29-38
    [2]刘永月一,曾海泉,李荣陆,等.基于语义分析的倾向性文本过滤.通信学报,2004,25(7):1-8
    [3]姚天顺,朱靖波,张俐,等.自然语言理解一一种让机器懂得人类语言的研究(第2版).北京:清华大学出版社, 2002,369一399
    [4]董振东,董强.知网.http://www.keenage.Com.2008
    [5] Jason Hunter.Java Servlet Programming.United States of America:O’Reily ,2002,34(7) :20-85
    [6] Andreas Ekelhart,Stefan Fenz,Gernot Goluch,Markus Steinkellner,Edgar Weippl: XML security—A comparative literature review.Journal of Systems and Software.2008,81(10)
    [7] DOM.http://www.w3school.com.cn/htmldom/index.asp
    [8] SAX.http://www.ibm.com/developerworks/cn/views/xml/tutorials.jsp?cv_doc_id=84979
    [9] JDOM.http://www.jdom.org/
    [10] DOM4J.http://www.dom4j.org/
    [11]索冬梅.自然语言理解研究.长春师范学院学报,2005(2):90-120
    [12]詹思瑜.自然语言的计算机处理模型:[硕士研究生论文].成都:电子科技大学,2003:8-14
    [13] Marcus, R.S. Computer and Human Understanding in Intelligent Retrieval Assistance.Proceedings of the 54th American Society for Information Science meeting, 1991,Vol. 28: 49-59
    [14] Salton, G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley Publishing, 1989,Vol.23:31-43
    [15] Rau, L. F. Conceptual Information Extraction and Retrieval from Natural Language Input.Proceedings of RIAO 88: User-Oriented Content-Based Text and Image Handling, 1988,Vol.34:424-437
    [16] Marcus, R. S., Reintjes, J.F. A Translating Computer Interface for End-User Operation of Heterogenous Retrieval Systems, Part I: Design; Part II: Evaluations.Journal of the American Society for Information Science, 1981, Vol.32(4): 287-317
    [17] Salton, G., Buckley, C. Improving retrieval performance by relevance feedback.JASIS 41, 1990,Vol.43: 288-297
    [18] Yang, J., Korfhage, R. R. Query Optimization in Information Retrieval Using Genetic Algorithms, Proceedings of the 5th International Conference on Genetic Algorithms.Urbana, IL,1993, Vol.23:603-611
    [19] Gordon, M. Probabilisitc and Genetic Algorithms for Document Retrieval. Communications of the ACM, 1988, Vol. 31:23-31
    [20] Baclace, P. E. Personal Information Intake Filtering. Bellcore Information Filtering Workshop, 1991, 34-51
    [21] Orwant, J.. Doppelganger Goes To School: Machine Learning for User Modeling. SM Thesis,Department of Media Art and Sciences, Massachusetts Institute of Technology, 1993.Vol.54:76-88
    [22] http://agents.www.media.mit.edu/groups/agents/publications/newt-thesis/tableofcontents2_1.html.2008
    [23] Christopher D.Hunter.Filtering the Future: Software Filters, Porn, PICS. the Internet Content Conudrum. http //www.ala.org/ alaorg/oif/hunterthesis.pdf,2008
    [24] Berg,G., T. W., Grant et al, Intelligent Information-Sharing Systems.Communications of the ACM, 1987,Vol.30:390-402
    [25] Suchak, M.A., GoodNews: A Collaborative Filter for Network News. SM Thesis, Department of Electrical Engineering and Computer Science, MIT, 1994,Vol.54:90-120
    [26] Goldberg, G., Nichols, D., Terry, D.,et.al.Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM, 2001,Vol.35 :61-70
    [27] CNNIC.第十七次中国互联网络发展状况统计报告. http://www.cnnic.net.cn/html/dir/2006/01 /17/3508.htm,2006
    [28]李仁飞,喻飞,朱森良,等.计算机网络安全.北京:科学出版社,2004,32-36
    [29]陈平,刘晓霞,李亚军.基于字典和统计的分词方法.计算机工程与应用.2008, 44(10):1-2
    [30]向晖,郭一平,王亮.基于Lucene的中文字典分词模块的设计与实现.现代图书情报技术.2006,08:1-5
    [31]陈平,刘晓霞,李亚军.基于字典和统计的分词方法.计算机工程与应用.2008, 44(10):1-2
    [32] Thomas G. Dietterich.Overview of inductive machine learning.The Board of Tustees of the Leland Standford Junior University, 1990
    [33]蔡灿民.基于词典的智能分词系统的研究与实现:[硕士研究生论文].昆明:昆明理工大学,2008:22-32
    [34] Du, M.-W.,Chang, S.C. An approach to designing very fast approximate string matching algorithms. Knowledge and Data Engineering, IEEE Transactions ,1994,Vol. 6:620 - 633
    [35] Yiming Yang and Jan O.Pedersen.A Comparative Study on Feature Selection in Text Categorization.Proceedings of ICML-97,14th International Conference on Machine Learning.1997,123-140
    [36] TFIDF.http://zh.wikipedia.org/wiki/TF-IDF
    [37]贺曼丽.基于内容挖掘的垃圾短信过滤分类方法研究: [硕士研究生论文].长沙:湖南大学,2007:22-27,28-34
    [38]王雷.基于改进贝叶斯算法的文本分类器的研究及其在NERMS的应用: [硕士研究生论文].吉林:吉林大学,2006,22-28,32-35
    [39]程泽凯等.文本分类器稳定性评估研究.情报学报, 2005,24(1):2-7
    [40]宫秀军.贝叶斯学习理论及其应用研究.计算机技术.2002,25(3):4-8
    [41] Androutsopoulos,G.Paliouras and E.Michelakis,Learning to Filter:Unsolicited Commercial E-Mail.Technical report ,2004,3
    [42]李艳基于改进的K_均值算法的朴素贝叶斯分类及应用:[硕士研究生论文].合肥:合肥工业大,2007
    [43] PaulGrahma. Better Bayesian Filtering. In Proeeedings of SpmaConferenee.Brenda S.Baker,2003,54(3):16-40
    [44]王潇,胡鑫.三种文本分类算法的比较.石河子大学学报(自然科学版),2005,5(2):3-10
    [45]余芳,一个基于朴素贝叶斯方法的web文本分类系统:webCAT.计算机工程与应用,2004,40(13)195-197
    [46] Pedro Domingos, Miehael Pazzznai. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Lemaing,1997,29:103~130
    [47] Friedman N, Geiger D, Goldszmidt M: Bayesian Network Classifiers. Machine Learning, 1977,29: 103-163
    [48] Kononenko I. Semi-native Bayesian classifier, In: Kodratoff Y, ed. Proc. of the 6th European Working Session on Learning, Spring-Verlag,1991:206~219
    [49] Ludmila I,Kuncheva.On the optimality of Naive Bayes with dependent binary features. Pattern Recognition Letters,2006,27(7),3-10
    [50] Mezghani N,Mitiche A,Cheriet M,Bayes Classification of Online Arabic Characters by Gibbs Modeling of Class Conditional Densities.Pattern Analysis and MachineIntelligence,2008,30(7),1121– 1131

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700