基于贝叶斯技术的邮件过滤研究

英文题名：Research of E-mail Filtering Based on Bayes Technology
作者：李雯
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：邮件过滤 ; 贝叶斯技术 ; 垃圾邮件 ; 朴素贝叶斯 ; Boosting方法
英文关键词：e-mail filtering ; Bayes technology ; spam ; Na(?)ve Bayes ; Boosting method
学位年度：2008
导师：刘培玉
学科代码：081202
学位授予单位：山东师范大学
论文提交日期：2008-04-08

摘要

一提到电子邮件(e-mail),相信大家都不会感到陌生。随着Internet的迅猛发展,电子邮件凭借使用方便、快捷、廉价的特点很快被广大网络用户所接受,已成为当前最流行的信息交流方式之一。但是电子邮件给我们带来便利的同时,垃圾邮件也随之产生,带来了巨大的危害。近年来大量的商业、色情、反动垃圾邮件和邮件病毒的泛滥给互联网用户带来很多烦恼和侵害,也给社会带来了极大的负面影响,邮件系统的安全问题引起业界的重点关注。垃圾邮件在国内的情况十分严重,中国如今成为了世界垃圾邮件来源的第三大国,反垃圾邮件迫在眉睫。因此研究垃圾邮件过滤具有着极其重大的现实意义。
     要对垃圾邮件进行综合治理,不仅需要通过法律途径和管理措施,而且需要好的邮件过滤技术。本文主要针对技术措施,探讨了垃圾邮件过滤的工作。主要研究工作包括:
     1.对邮件过滤技术和贝叶斯技术进行了分析和研究。
     本文首先对垃圾邮件过滤的研究背景和研究现状做出了分析,包括垃圾邮件的危害以及特征类型,揭示了垃圾邮件之所以泛滥成灾、屡禁不止的原因。本文归纳分析了目前国内外常见的各种主流反垃圾邮件技术,并分别指出它们的特点和缺陷。并对贝叶斯技术和朴素贝叶斯算法的基本原理以及在邮件过滤中的应用做了探讨和研究。
     2.提出一种对朴素贝叶斯的改进算法——改进朴素贝叶斯算法。
     基于概率统计的朴素贝叶斯算法具有方法简单、运算速度快、分类精确度高等优点,在文本分类中得到广泛应用。但是,在邮件过滤过程中,合法邮件被误判为垃圾邮件将可能给用户带来巨大的损失。传统的朴素贝叶斯算法在对邮件进行分类与过滤时,没有充分考虑到合法邮件与垃圾邮件具有的不同特性,因此用于邮件过滤时具有一定的局限性。在此基础上本文引入损失最小化的思想,将其与朴素贝叶斯算法结合起来,并根据垃圾邮件的特性做了改进,给出一种改进的朴素贝叶斯垃圾邮件过滤算法。该算法能够根据用户的需求通过调整k值,来达到相应的过滤效果。
     3.将Boosting算法引入邮件过滤领域,提出另一种对朴素贝叶斯算法的改进算法——基于Boosting方法的改进贝叶斯算法。
     虽然改进朴素贝叶斯算法能够根据k值的动态选择,使系统有侧重地对待分类邮件进行过滤,但是k值取的过大或是过小都会使邮件过滤的精确率有所下降。
     Boosting方法最大的特点是可以有效地提升算法的精度,它可以将精度较低的“弱学习算法”提升为精度较高的“强学习算法”。为了提高邮件过滤的精确率,本文将Boosting方法应用于邮件过滤领域,用Boosting方法对朴素贝叶斯算法进行提升,提出了一种新的邮件过滤算法——基于Boosting方法的改进贝叶斯算法。实验结果表明,该算法提高了邮件分类的精确度,降低了邮件的误判率,减少了传统方法处理时信息的丢失和错判的情况,改善了邮件过滤的整体性能。
     4.设计和实现了基于改进贝叶斯算法的邮件过滤系统。
     我们将本文提出的改进贝叶斯算法在邮件过滤技术平台进行了实际应用层面的测试,实验数据证明了算法的可靠性和有效性,在对垃圾邮件进行分类与过滤时取得了令人满意的测试效果。
It is very familiar to us when we talk about electronic mail (e-mail). With the rapid development of Internet, e-mail has become one of the most popular communicating modes for users for its conveniency, speediness and cheapness. But spam (also referred to as“junk mail”) is emerged with the convenience of e-mails, and bring harms to users. In recent years, the flooding of all kinds of spam has become a headache problem for human and society. Mail system security arouses widespread interest and becomes a research focus in industry. Spam is very serious in China. Nowadays, China has been the third most serious country in the world about the spam. So study of spam filtering is of great significance.
     In order to deal with the spam effectively, we need not only lawmaking and management measures but also good spam filtering technology. This paper mainly studies the spam filtering technology, the contents are as follows:
     1. Analyses and studies the e-mail filtering technology and Bayes technology.
     In this paper, we analyses research background and current status of the spam filtering, including the harm and characteristics of spam and the reason why spam becomes more and more. We make a research on the nowaday prevalent anti-spam technologies all over the world, point out the advantages and the disadvantages of them. Then we study the Bayes technology and Na?ve Bayes algorithm detailed.
     2. Proposes an improved filtering algorithm based on Na?ve Bayes—the improved Na?ve Bayes algorithm.
     Compared with the other text classifiers, Na?ve Bayes algorithm has more widely been used in the area of text classification for the simply method can classify texts correctly and more quickly. Mistaking the legitimate mail as spam will produce more loss than mistaking the spam as legitimate mail. However, the traditional Na?ve Bayes method doesn't consider the different features between the legitimate mail and the spam in the process of classifying and filtering mail and doesn't take into account the loss of misclassifying legitimate mail as spam, so there are some limitations of e-mail filtering. An improved algorithm of spam filtering is presented in this paper, which can minimize user's loss. The improved Na?ve Bayes algorithm can achieve user's purpose by changing the value of k .
     3. Another improved algorithm combined the Na?ve Bayes algorithm with Boosting method is proposed in this paper—the improved Na?ve Bayes algorithm based on Boosting method.
     The improved Na?ve Bayes algorithm can make the filtering system focus on different types of e-mails according to choosing different value of k .However, choosing the value of k too much or too small will reduce the accuracy rate.
     The greatest feature of Boosting method is boosting the accuracy of algorithms. Boosting method can effectively transform the weak learning algorithm into strong learning algorithm. The improved Na?ve Bayes algorithm based on Boosting method is proposed to improve the accuracy of spam filter. The experiment results illustrates that the improved filtering algorithm can reduce the loss of the information and the error rate of misclassifying mail. The improved filtering algorithm has better performance than the traditional Na?ve Bayes method.
     4. Designs and implements the e-mail filtering system based on improved Bayes algorithms.
     Finally, we put these improved Bayes algorithms into action in e-mail filtering system and the the experimental result shows the reliability and the validity of the algorithms. The e-mail filtering system achieves satisfying test result.

引文

[1]F. Crimmins, A. Smeaton, T. Dkaki, et al. Information discovery on the Internet. Intelligent Systems and Their Applications, IEEE. 1999: 55-62
    [2]Levitt, Mark Comiskey, Mike. Bright Light Focuses on Eliminating Spam. IDC Corporation. July 1998
    [3]J. S. Kelly. A brief history of spam. IBM developerWorks. 2004
    [4]中国互联网协会反垃圾邮件中心. 第三次《反垃圾邮件状况调查报告》. http://www.anti-spam.cn/ShowArticle.php?id=6421. 2007
    [5]Graham Paul. A plan for spam. http://www.paulgraham.com/spam.html. 2003
    [6]陈章. 积极行动综合治理垃圾邮件. 中国反垃圾邮件技术研讨会(Conference on China Anti-Spam technology,CCAS). 2004
    [7]Kaiesh Vohra. The Identification of Unsolicited Electronic Mail. http://www.kaiesh.eom/anna/KaieshVohra2005-Antispam.pdf. 2005
    [8]I. Androutsopoulos, G. Paliouras, E. Michelakis. Learning to filter unsolicited commercial e-mail. Technical report, National Centre for Scientific Research “Demokritos”. 2004
    [9] 中国教育和科研计算机网 . 《关于制止垃圾邮件的管理规定》 . http://www.ccert.edu.cn/spam/bulletin/policy.htm. 2002
    [10] 中国互联网协会 . 《中国互联网协会反垃圾邮件规范》 . http://www.isc.org.cn/ca1.htm. 2003
    [11]Joe Habraken. 计算机网络(第三版). 北京: 人民邮电出版社. 2002
    [12]曹麒麟,张千里. 《垃圾邮件与反垃圾邮件技术》. 北京:人民邮电出版社. 2003
    [13]Jason D.M.Rennie. An Application of Machine Learning to E-Mail Filtering. Proc Of The Sixth ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, Boston. 2000(8):20-23
    [14]Christopher Lueg. Spam and Anti-Spam Measures:A Look at Potential Impacts.Information Science. June 2003
    [15]Thomas Bayes. An Essay Towards solving a problem in the doctrine of chances. 1763, 2:726-730
    [16]Nir Friedman, Moises Goldszmidt. Building Classifiers using Bayesian Networks. In: Proc Of The 13th National Conf on Artificial Intelligence.AAAI Press. 1996
    [17]Pat Langley,Wayne Iba,Kevin Thompson. An Analysis of Bayesian Classifiers. In:Proc of 10th National Conf on Artificial Intelligence. San Jose: AAAI Press. 1992
    [18]Mehran Sahami,Susan Dumais,David Heckerman,Eric Horvitz.A Bayesian Approach to Filtering Junk E-mail.Learning for Text Categorization:Papers from AAAI Workshop, Madison ,Wisconsin. 1998:55-62.
    [19]KMA Chai, HL Chieu, HT Ng. Bayesian online classifiers for text classification and filtering. Proceedings of the 25th ACM International Conference on Research and Development in Information Retrieval. 2002: 97-104
    [20]边肇祺,张学工等.模式识别(第二版).北京:清华大学出版社. 2000
    [21]T. M. Mitcheell. Machine learning. McGraw-Hill. 2003
    [22]D.D. Lewis. Na?ve (Bayes) at forty: The independence assumption in information retrieval. Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany. 1998:4-15
    [23]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, C. D. Spyropoulos. An Evaluation of Na?ve Bayesian Anti-Spam Filtering. Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain. 2000:9-17
    [24]I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, P. Stamatopoulos. Learning to Filter Spam E-Mail: A Comparison of a Na?ve Bayesian and a Memory-Based Approach. Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France. 2000:1-13
    [25]Schneider, K. A Comparison of event models for Na?ve Bayes anti-spam e-mail filtering In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics Budapest, Hungary. 2003:307-314
    [26]Andrew Mccallum, Kamal Nigam. A Comparison of Event Models for Na?ve Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization. 1998
    [27]D. Mertz. Six approaches to eliminating unwanted e-mail. http://www-900.ibm.com / developerWorks / cn / linux / other / l-spamf/ index_eng.shtml. 1999
    [28]Y.Diao, H.Lu and D. Wu. A Comparative Study of Classification Based Personal E-mail Filtering.In:Proceedin of PAKDD-2OOO. 2000:408-419
    [29]中国教育和科研计算机网紧急响应组. http://www.ccert.edu.cn/index.htm
    [30]中国反垃圾邮件联盟. http://www.anti-spam.org.cn/
    [31]FREUND Y, LYER R, SCHAPIRE R E, et al. An efficient boosting algorithm for combining preferences in machine learning. Proceedings of the Fifteenth International Conference. 1998:1-9
    [32]Keams M, Valiant L G. Learning Boolean Formulae or Factoring. Technical Report TR-1488, Cambridge,MA:Havard University Aliken Computation Laboratory. 1988
    [33]Keams M, Valiant L G. Crytographic Limitation on Learning Boolean Formulae and Finite Automata. In:Proceedings of the 21st Annual ACM Symposium on Theory of Computing,New York,NY:ACM press. 1989:433-444
    [34]Schapire R E. The Strength of Weak Learnability. Machine Learning. 1990(5):197-227
    [35]Freund Y. Boosting a Weak Algorithm by Majority. Information and Computation. 1995, 121(2):256-285
    [36]Freund Y, Schapire R E. A Decision-theoretic Generalization of Online Learning and an Application to Boosting. Journal of Computer and System Sciences. 1997, 55(1):119-139
    [37]曾春, 邢春晓, 周立柱. 基于内容过滤的个性化搜索算法. 软件学报. 2003,14(5):999-1004
    [38] 中国教育和科研计算机网紧急响应组的中文邮件样本集 . http://www.ccert.edu.cn/spam/sa/datasets.htm
    [39]白雪生. 基于内容检索及其相关技术的研究[博士论文]. 北京:清华大学. 1998
    [40]曹莉华. 视频媒体的基于内容处理和检索的研究与实现[博士论文]. 长沙:国防科学技术大学. 1998
    [41] 王辰. 多媒体融合分析技术的研究与实现[博士论文]. 长沙:国防科学技术大学. 2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700