基于信息融合准则的邮件过滤系统的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
基于内容的垃圾邮件过滤技术是Internet安全技术研究的一个重点问题。将机器学习的相关方法应用于垃圾邮件的判定是进行大量垃圾邮件处理的有效方法。本文针对电子邮件的特点,通过分析传统邮件过滤技术的不足之处,在对大量垃圾邮件进行统计分析的基础之上,基于信息融合准则对邮件过滤技术进行了研究。本文主要包括以下几个方面的内容:
     1、综述垃圾邮件过滤问题的研究现状,包括垃圾邮件的定义、危害以及当前主要垃圾邮件过滤技术;在总结比较常用的特征提取方法及过滤算法的基础上,提出了一种利用期望交叉熵(CE)代替词频逆文档频率(TFIDF)算法中IDF函数进行分类的词频交叉熵(TFCE)算法。
     2、在深刻理解信息融合技术的基础上,通过理论分析,针对传统垃圾邮件判决采用单一准则的缺陷,重点研究了基于三角膜算子的垃圾邮件融合判决准则。其后详细阐述了该准则的原理和评价结果以及具体实现过程,包括体系结构、功能模型和组织模型、邮件过滤的流程和垃圾邮件反馈模块等问题。
     3、利用实验检验了算法的有效性。仿真实验主要分为两部分:一是比较了邮件过滤系统中各种基于评估函数的特征提取方法,如文档频率(DF)、互信息(MI)、信息增益(IG)、期望交叉熵(CE)、词频逆文档频率(TFIDF)和本文提出的新的特征提取算法词频交叉熵(TFCE)的优缺点和特征提取精度;二是将基于三角模算子的信息融合判决准则与基于词频或文档频率的采用单一准则的判决方法进行了比较。
     论文最后对基于词频交叉熵(TFCE)算法和信息融合准则的邮件过滤系统提出了进一步完善、改进的意见,从而得出最佳决策,有效降低邮件漏判、错判的概率,为邮件过滤技术的发展提供了一个新的探索途径。
Nowadays email is one of the most common network applications and has become the most important communication method. Content-based spam filtering is an important issue in Internet security technology. Application of machine learning approaches such as text categorization to spam determination is an efficient way for dealing with plenty of spam.
     This paper aims at characteristics of e-mail by analyzing the inadequacy of traditional technology in filtering spam on the basis of a large number of statistical analyses. We put emphasis on comparing the advantages, disadvantages and scope of applications of various feature selection methods, and achieve a Cross Entropy (CE) to replace IDF function of Term Frequency Inverse Document Frequency (TFIDF) algorithm, named Term Frequency Cross Entropy (TFCE). A new judgment has been proposed which is based on triangle module fusion at the same time to further improve accuracy of feature selection and effectively reduces the probability of mail misjudgment and lost of judgment.
     This thesis mainly includes the following parts: Summarize the state of spam filtering which include the definition of spam, danger and filtering techniques; Generalize common approaches of feature pruning, anti-spam filter and mail corpora. Also we emphasize on feature selection methods and filtering algorithms, the theory of TFCE; Summarize the framework and implementation of new algorithms which mainly include architecture, function model, organization model and flowchart of spam filtering. Based on research and academic analysis of information fusion technology, we give a detail analysis on the spam fusion judgment criterion. Simulation results are shown to verify its performance: One is comparison of various feature selection method, including TFCE; the other one is comparison between information fusion criterion based on triangle module and single judgment criterion. The simulation results suggest that Average accuracy of TFCE is higher than that of other traditional feature selection methods and the performances of information fusion criterion based on triangle module are also better than those of single judgment criterion.
     Finally, this paper proposes some suggestions to further improve the performances of spam filtering system based on TFCE feature selection method and triangle module fusion algorithm and effectively reduce mail misjudgment and lost of judgment, provides a new probability for the development of e-mail filtering technology.
引文
[1]反垃圾邮件技术发展.http://mail.cstnet.cn/cstnet/help/security.html
    [2]陈伟周.垃圾邮件泛滥考验互联网 抵制工作任重而道远.通信信息报,2003,10-14
    [3]汪晓平等.Visual C++网络通信协议分析与应用实现.北京:人民邮电出版社,2003,2.348-355.
    [4]David H.Crocker.Standard for the format of ARPA Internet text messages.RFC822,August 13,1982
    [5]Network Working Group P.Resnick.Internet Message Format.RFC2822,April 2001.
    [6]J.Galvin.Security Multiparts for MIME:Multipart/Signed and Multipart/Encrypted.RFC1847,October 1995
    [7]N.Freed-Gateways and MIME Security Multiparts.RFC2480,January 1999.
    [8]Network Working Group B.Ramsdell.S/MIME Version 3 Certificate Handling.RFC2632,June 1999
    [9]Network Working Group B.Ramsdell.S/MIME Version 3 Message Specification.RFC2633,June 1999.
    [10]Jonathan B.Postel Simple Mail Transfer Protocol.August 1982
    [11]Network Working Group J.Klensin.Simple Mail Transfer Protocol.RFC2821,April 2001
    [12]曹麒麟,张千里.垃圾邮件与反垃圾邮件技术.北京:人民邮电出版社,2003,2.5-6
    [13]曹麒麟,张千里.垃圾邮件与反垃圾邮件技术.北京:人民邮电出版社,2003,2.11-12.
    [14]BORENSTE1N,KORENY.Obstacle avoidance with ultrasonic sensor[J].IEEE Journal of Robotics and Automations,1988(4):213-218.
    [15]Le Zhang.An Evaluation of Statistical Spam Filtering Techniques.ACM Transactions on Asian Language Information Processing.2004,3(4):243269
    [16]M.Sahami,S.Dumais,D.Heckerman,E.Horvitz.A Bayesian approach to filtering junk e-mail,in Proc.of AAAI Workshop on Learning for Text Categorization[C].1998:55-62
    [17]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究,2001,9(9):23-26
    [18]G.Sakkis,I.Androutsopoulos,G.Paliouras,V.Karkaletsis,C.Spyropoulos,P.Stamatopoulos.A memory-based approach to anti-spam filtering.TechReport DEMO 2001,National Centre for Scientific Research "Demokritos"[R].2001
    [19]David D.Lewis Feature Selection and Feature Extraction for Text Categorization Proceedings of Speech and Natural Language Workshop.1992
    [20]张华平,刘群.基于N-最短路径方法的中文词语粗分模型.中文信息学报,2002,16(5):77-84
    [21]SahamiM.Using Machine Learning to Improve Information Access[D].Computer Science Department,Stanford University,USA,1999.
    [22]K.Schneider.A comparison of event models for naive bayes anti-spam e-mail filtering.In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics(EACL'03)[C].2003:307-314
    [23]Yanlei Diao,Hongjun Lu,Dekai Wu.A comparative study of classification-based personal e-mail filtering,in Proceedings of PAKDD-00,4th Pacific-Asia Conference on Knowledge Discovery and Data Mining,Kyoto,JP.2000:408-419
    [24]Androutsopoulos,G.Paliouras,V.Karkaletsis,G.Sakkis,C.D.Spyropoulos,P Stamatopoulos.Learning to Filter Spam E-Mail:A Comparison of a Naive Bayesian and a Memory-Based Approach.in Proc.4th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2000)[C].2000,9:1-13
    [25]Yiming Yang.An evaluation of statistical approach to text categorization.Information Retrieval,School of Computer Science,Camegie Mellon University.1999(1):69-90
    [26]Lee,C.,Landgrebe,D.A."Feature extraction based on decision boundaries",IEEE Transactions on Pattern Analysis and Machine Intelligence,Volume:15 Issue:4,April 1993,Page(s):388-400
    [27]S.Eyheramendy,D.Lewis,D.Madigan.On the Naive Bayes Model for Text Categorization.To appear in Artificial Intelligence & Statistics 2003.2003.
    [28]李洪志.信息融合技术.国防工业出版社,1996(1-150)
    [29]Libby E W and Maybeck P S.Sequence Comparison Techniques for Multi-sensor Data Fusion and Target Recognition[J].IEEE Trans.on AES,1996,32(1):52-65
    [30]吴新恒.多传感器信息融合理解技术研究[D].中船总709所,1995
    [31]景晓军,尚勇等.基于三角模融合准则的滤波算法.电子学报,2004,Vol.32,No.6