用户名: 密码: 验证码:
基于贝叶斯算法的垃圾邮件过滤研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着因特网的迅猛发展,电子邮件成为了现代通信的主要手段。但是同时许多垃圾邮件也在网络中蔓延,给广大用户带来了大量的麻烦。因此能够有效地防治垃圾邮件是一个有重要意义的现实问题。
     本文首先深入研究了国内外大量反垃圾邮件文献和数据,对已有的垃圾邮件过滤技术做出分析和总结。垃圾邮件过滤技术是反垃圾邮件的重要手段,目前主要有基于安全认证的垃圾邮件过滤技术、基于规则的垃圾邮件过滤技术和基于统计学习的垃圾邮件过滤技术,后两者都是基于内容的垃圾邮件过滤技术。
     本文研究了基于内容的垃圾邮件过滤算法,主要对贝叶斯算法及其分类模型进行了深入的研究,通过实验方法对PG贝叶斯算法、GR贝叶斯算法和朴素贝叶斯算法进行了详细的分析和对比测试,重点讨论了朴素贝叶斯算法在垃圾邮件过滤中的优点和不足,并针对其不足,通过选择基于卡方分布的特征选取算法进行改进,以进一步提高中文分词的准确性和效率;通过最小风险因子的引入,降低对垃圾邮件的误判风险以减少用户的干预频度,提高识别效率;通过认知学习算法的提出,提高模型的自学习能力,同时极大地降低了高维向量空间垃圾邮件的识别难度,使模型达到了更好的精确率和召回率。
     本文在基于最小风险的朴素贝叶斯算法的基础上,进一步引入认知学习的理论,从技术上对高维空间向量的垃圾邮件过滤提供了很好的解决方案,实验结果证明,此方法可进一步提高垃圾邮件的识别率,特别是较好的解决了高维特征向量空间的垃圾邮件过滤问题,从而为基于人工智能的垃圾邮件过滤技术的研究打下了基础。
With the rapid development of Internet, E-mail has become a primary means in modern telecommunication. However, spams (also named as "junk mails") ,simultaneously pervade widespread on line, bringing a lot of troubles to numerous users. Therefore, it is important and practical to prevent and control spasm effectively.
     The thesis, on the one hand, investigates thoroughly considerable anti-spam documents and data from both home and abroad. Furthermore, analysis and conclusion are made on existing anti-spam techniques. The E-mail filter technology is an important measure against spams, which at present is mainly based on IP address, rules and the content respectively,and the latter two are mainly based on the contents.
     The thesis mainly talked about spam filter algorithm based on contents,whose feature is text categorization,i.e.to preprocess the text content of mail and then recognize spams over text categorization. And at the same time Baysian algorithm and its categorization model are studied deeply in the dissertation. A detailed analysis and comparable testing on PG Baysian algorithm are put forward throngh the experiments,in which the strengths and limitations of austerity Baysian algorithm in the anti-spam filter are mainly discussed.In order to increase the accuracy and the efficiency of Chinese words sputter,the algorithm is selected on the basis of the characteristic of x2 and try to improve through the method of balancing the key words;and through the introduction of the minimum risk,the risk of the misjudgement on the spasm is reduced to the aim of decrease of the frequency of interference in order to increase the efficienly of recognition;and through the forward of the cognition learning algorithm,increased the capability of self-study of the model and reduced the recognition difficulties of the vector quantities spams,so that the model can reach the perfect accuracy.
     The thesis puts forward a better solution to vector quantities spam filter through technique based on minimum risk of austerity algorithm and through the introduction of cognition learning.The experients proves that the forward of the method increased the recognition percentage of the spams,especially solved the problems of the spam filter,and finally pay its effort for the research on the basis of artificial intelligence.
引文
[1]Diao Y.,Lu H.,Wu D.A Comparative study of Classification Based Personal E-Mail Filtering[C].In: Proc of The Fourth Pacific-Asia Conf on Knowledge Discovery and Data Mining, Keihanna Plaza, Kyoto, Japan,2000[4]:18-20
    [2]Ion Androutsopoulos, John KoutSias, Konstantinos V et al. An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal Email Messages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2000), Athens, Greece, pp. 160-167
    [3]Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, C. Spyropoulos. An evaluation of na?ve bayesian anti-spam filtering. In Proceeding soft he Workshop on Machine Learning in the New Information Age[C].2 000:9 -17
    [4]孙艳华.垃圾邮件过滤技术的研究.大连海事大学学位论文.2007.3:12-14
    [5]马哲.垃圾邮件过滤系统的研究与实现.浙江大学硕士学位论文.2005.3:28-29
    [6]李闻天.基于贝叶斯过滤算法的反垃圾邮件策略.昆明理工大学学报(理工版).2005.3
    [7]杨杉,何跃,颜锦江.基于贝叶斯的反垃圾邮件技术探讨.网络安全技术与应用.2007.8
    [8]李雯,刘培玉.基于贝叶斯的垃圾邮件过滤算法的研究.计算机工程与应用.2007.9
    [9]李兆翠,刘培玉,周洪利.基于贝叶斯方法的客户端邮件过滤器的设计与实现.网络与通信安全.2007.3
    [10]肖昊,刘晓璐,屠立忠.基于贝叶斯分类的邮件过滤方法及模型研究.南京师范大学学报(工程技术版).2006.6
    [11]刘明川,彭长生.基于贝叶斯概率模型的邮件过滤算法探讨.重庆邮电学院学报(自然科学版) 2005.5
    [12]宁绍军,邹恒明.基于贝叶斯公式的自适应垃圾邮件过滤方法.计算机应用与软件2007.11
    [13]白东燕.改进的贝叶斯算法在反垃圾邮件中的应用.电脑知识与技术.2007.3
    [14]李惠娟,高峰.基于贝叶斯神经网络的垃圾邮件过滤方法.微电子学与计算机.2005.4
    [15]林巧民,许建真,许棣华,王诚.基于贝叶斯算法的垃圾邮件过滤技术.2005.4
    [16]翟正德,李伟,王鹏,基于贝叶斯统计法的垃圾邮件过滤研究.山东理工大学学报(自然科学版).2005.4
    [17]周立兵,柳景超.贝叶斯理论在垃圾邮件过滤中的应用分析.网络安全技术与应用 2006.11
    [18]陈晋川,陈治璋.基于模式的贝叶斯垃圾邮件过滤的研究与实现.计算机工程与应用.2006.6
    [19]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述.中文信息学报.2005.5
    [20]曾强,张小敏,翟少华.基于最小风险贝叶斯算法的邮件过滤系统设计.河北师范大学学报(自然科学版).2006.4
    [21]张 羿,周建国,晏蒲柳.垃圾邮件过滤系统的研究与实现.计算机工程.2006.9
    [22]裴亚辉,熊盛武.垃圾邮件与反垃圾邮件技术.电脑知识与技术.2007.5
    [23]梁宏胜,徐建民,成岳鹏.一种改进的朴素贝叶斯文本分类方法.河北大学学报(自然科学版).2007.3
    [24]邹 磊,卢炎生,崔得暄,胡蓉.一种基于最小损失的垃圾邮件屏蔽算法.华中科技大学学报(自然科学版).2005.12
    [25]宁 静.邮件内容过滤技术探讨.铁路计算机应用.2005.7
    [26]张铭锋,李云春,李 巍.垃圾邮件过滤的贝叶斯方法综述.计算机应用研究.2005.8
    [27]赵治国,谭敏生,李志敏.基于改进贝叶斯的垃圾邮件过滤算法综述.南华大学学报(自然科学版).2006.1
    [28]王宁,张建忠,何云,申庆永,徐敬东.基于改进贝叶斯模型的中文邮件分类算法.计算机工程与应用.2006.3
    [29]戴劲松,白英彩.基于贝叶斯理论的垃圾邮件过滤技术.计算机应用与软件.2006.1
    [30]孙国菊,张杰.中文文本分类的特征选取评价.哈尔滨理工大学学报.2005.1
    [31]肖明,殷锋,张楠.垃圾邮件过滤技术及发展.西南民族大学学报(自然科学版).2007.1
    [32]赵向军,路梅.垃圾邮件过滤算法研究.徐州师范大学学报(自然科学版).2006.4
    [33]谢印芬,马玉亮.反垃圾邮件技术浅析.临沂师范学院学报.2004.6
    [34]池万乐,张笑笑.改进贝叶斯算法的垃圾邮件过滤技术研究.现代计算机.2007.4
    [35]李杨继.垃圾邮件特征的判别模型研究.四川大学硕士学位论文.2005.5
    [36]王金森.文本分类算法在垃圾邮件过滤中的研究与应用.吉林大学硕士学位论文.2005.4 24-29
    [37]Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, C. S pyropoulos. An evaluation of na?ve bayesian anti-spam filtering.In Proceeding soft he Workshop on Machine Learning in the New Information Age[C].2 000:9-17
    [38]赖均.反垃圾邮件技术的研究和原型实现.电子科技大学学位论文.2005.1 [ 39 ] JASOND, RENNIEM. An application of machine learning to E-mail filtering[EB/OL].http://www.cs.cmu.edu/jr6b/papers/ifile98.ps,1998-08-20.
    [40]SAHARNI M.Using machine learning to improve information access[D]. Standford:Stanford University,1998.
    [41]KOLLERD, SAHAMI M. Hierarchically classifying documents using very few words[EB/OL].http://www.dmre2 search.net/papers/1000000040. pdf, 1997-02-25. [ 42 ] YANG Yiming.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval,1997.1(1):69 一 90.
    [43]罗海飞,吴 刚,杨金生.基于贝叶斯的文本分类方法.计算机工程与设计.2006.12
    [44]Jensen FV. Bayesian networks and decision graphs [M]. New York: Springer, 2001.
    [45]刘群.汉语词法分析和句法分析技术综述.2002.08.
    [46]许洪波,程学旗,土斌等.文木挖掘与机器学习[J]信忿技术快报.2005.3( 2): 1-14
    [47]张启蕊,张凌,董守斌等.训练集类别分布对分类文木的影响.清华大学学报(自然科学版) 2005
    [48]耿赓.汉语自然语言检索中的词法分析处理.情报科学.2004
    [49]李静梅,孙丽华,张巧荣等.一种文本处理中的朴素贝叶斯分类器.哈尔滨工程大学学报.2003,24.1:71-74
    [50]丁照字,鲁红英,肖思和等.智能信息处理的 Bayesian 方法研究进展.计算机应用研究.2002.8:1-3,6
    [51]吕岳,施鹏飞,赵宇明.改进的贝叶斯多分类器组合规则.数据采集与处理.2000.15.2:204-207
    [52]王雷,林亚平,彭雅等.基于认知学习的最小风险贝叶斯邮件过滤算法系统仿真学报.2004,16.3:413-416

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700