基于适应概念漂移的垃圾邮件过滤系统设计与实现

作者：党建军
论文级别：硕士
学科专业名称：信息与通信工程
中文关键词：垃圾邮件 ; 中文分词 ; 朴素贝叶斯算法 ; 概念漂移
英文关键词：Spam ; Chinese word splitter ; naive bayes ; concept drift
学位年度：2010
导师：张凤荔
学科代码：081001
学位授予单位：电子科技大学
论文提交日期：2010-04-01

摘要

电子邮件作为当今一种方便、快捷的互联网信息交流方式,受到越来越多人的青睐。但是垃圾邮件的出现,并且日益严峻,使这种便利的方式给人带来了烦恼。反垃圾邮件技术已成为互联网信息领域的一个研究热点,基于内容的反垃圾邮件过滤技术更是一种成熟而且有效的技术方案。
     基于朴素贝叶斯的垃圾邮件过滤方法是当前基于文本内容过滤方法的有效算法之一。随着时间的变化,垃圾邮件的特征也在不断的改变,然而传统的训练模型必须重新进行训练才能适应新的邮件特征的改变。因此,传统的朴素贝叶斯过滤方法必须与其它技术结合才能有效的适应新特征的变化。本文提出的实例选择-分类器加权集成算法,是采用数据挖掘领域的流问题解决方案来适应邮件流的问题的解决思路,成为当前的研究热点。本方法是在研究朴素贝叶斯的基本原理,分析其优缺点的基础上,基于传统分类器的静态特性,将概念漂移的思想应用到垃圾邮件过滤系统上,在中文的CCERT“2005-Jul”数据集上,取得了不错的效果,不仅在从精度上,更重要的适应性上,从不适应到适应,从精度低到精度高,完成了一个动态的适应过程。
     1)本文首先分析了中文词语的特点和常见的词典结构,解读了朴素贝叶斯算法的基本原理,概念漂移的基本思想,同时给出了通用分类算法评价标准。
     2)在第三章,描述了整个系统的总体目标,以及本模块的总体架构,并给予了模块概括性的描述。
     3)在第四章,阐释模块内部各个功能点的详细设计和实现,提供了伪代码级的说明了详述。
     4)在测试和分析章节,首先详述了中文和英文的语料集,并就该模块系统的参数和数据集选取给予了详细的说明,在概念漂移发生或未发生时,同传统分类器,在精度和适应性上的对比,并做出了详细的分析。
     综上所述,本系统提出对传统领域的垃圾邮件过滤模型的适应性研究是一个有实践价值、理论意义的尝试。
Email is popular as one of convenient and economical ways of communication available by the internet.,however, spam appears, and even worse, becomes harassment for more and more persons and companies.Anti-spam technology has been hot pot in the realm of the Internet..the technology of filtering the spam based on the content is one of effective and efficiency methods.
     Naive bayes text classification technique has a dominant place in the area of spam filtering for its good categorization, high precision. As time goes, so goes the feature of mail, especially for spam.however, when the new feature appears, the traditional model of filtering the spam must be trained by the new mail which contains the changed features, therefore, the traditional models or methods of trained should be bound to be grafted on new methods or thinking to adjust to the constantly changing environment. The paper shows the methods who we call it combined instance selection-weighted of classifier algorithm ,from the domain of mining data streams,as a thinking for spam filtering. The method is prompted based on the basic principle of naive bayes,and the strong points and weaknesses; Based on static characteristic of the traditional models , the paper combines the idea of concept drift with the traditional models. The data set is“2005-Jul”provided by CCERT. The result is more efficient,not only on the precision, but also on the adaptation, the experiment reveals the process of dynamic adaptation.
     1) The paper analyzes the characteristic of Chinese character and the structure of dictionaries, then, gives a general overview of the basic principle of naive bayes, and the basic thought of concept drift, at the same time, the general criterions of classification.
     2)In chapter three,the paper gives the overall object of the system and the whole structure of the algorithm, describes the modules of the algorithm.
     3) In chapter four, the paper gives a specific description of Function Points, even pseudo codes.
     4) In the section of test and analysis chapter, we firstly induces the datasets of English and Chinese, explains the choice of datasets for the test, gives the results of experiment, including the diversification when concept drift takes place or not, not only on the precision,most of important, on the adaptation.At last ,the paper offers the analysis for the readers .
     To the conclusion, the paper offers one new attempt of practical merit and groping meaning to traditional trained model when the environment changes.

引文

[1] Kaspersky.Spam report,2009.http://www.viruslist.com
    [2]中国互联网协会反垃圾邮件中.2009年第三季度中国反垃圾邮件调查报告,2009.http://www.12321.cn
    [3]罗琴.一种垃圾邮件混合过滤技术研究:[硕士学位论文].成都:电子科技大学,2007,2-2
    [4]曹麒麟,张千里.垃圾邮件与反垃圾邮件技术[M].人民邮电出版社,2003-02,15-17,26-29,104-107
    [5]曾志华.基于潜在语义分析的垃圾邮件过滤系统设计与实现:[硕士学位论文].成都.:电子科技大学,2009,3-7
    [6]郭曙光.基于SpamAssasin的垃圾邮件处理.网络安全技术与应用,2008.12:64-66
    [7] Hsiao Wen-Feng,Chang Te-Ming,Hu Guo-Hsin.ACluster-based Approach to Filtering Spam under Skewed Class Distributions.Proceedings of the 40th Hawaii International Conference on System Sciences,2007:53-60
    [8] Zhang Peng-Fei,Su Yu-Jie,Wang Cong.Statistical Machine Learning Used in Integrated Anti-Spam System.Proceedings of the Sixth International Conference on Machine Learning and Cybernetics,2007:4055-4058
    [9] Sasaki M,Shinnou H.Spam detection using text clustering.Proceedings of the 2005 International Conference on Cyberworlds(CW’05),2005
    [10]张秋余,张博,迟宁.自然语言语义理解在反垃圾邮件中的应用.计算机应用,2006,26(6):1315-1317
    [11]Brewer D,Thirumalai S,Gomadam K,et al.Towards an Ontology Driven Spam Filter.Proceedings of the 22nd International Conference on Data Engineering Workshops, 2006:79-79
    [12]Eric Jiang.Learning to Semantically Classify Email Messages.Intelligent Control and Automation,2006:700-711
    [13]Hyun-Jun Kim,Jenu Shrestha,Heung-Nam Kim,et al.User Action Based Adaptive Learning with Weighted Bayesian Classification for Filtering Spam Mail.AI 2006:Advances in Artificial Intelligence,2006:790-798
    [14]Yang Zhen,Nie Xiangfei,Xu Weiran,et al.An Approach to Spam Detection by Naive Bayes Ensemble Based on Decision Induction.Proceedings of the Sixth International Conference onIntelligent Systems Design and Applications,2006:861-866
    [15]张铭锋,李春云,李巍.垃圾邮件过来的贝叶斯方法综述.计算机应用研究,2005,(8):14-19
    [16]秦春秀,刘怀亮,赵捧未.一种基于本体论和潜在语义索引的文本语义处理方法.现代图书情报技术,2006,(141):34-37
    [17]汤世平,樊孝忠,朱建勇.基于潜在语义分析的本体空间表示模型研究.计算机应用于软件,2008,25(1):53-55
    [18]丁振国,黎靖,张卓.一种改进的基于神经网络的文本分类算法.计算机应用研究,2008,25(6):1639-1641
    [19]任劼，项婧.基于神经网络的电子邮件分类与过来.计算机工程与设计,2006,27(6):1021-1024
    [20]周俊怡.一种混合垃圾邮件过滤技术研究:[硕士学位论文],成都:电子科技大学,2009,8-13,46-53
    [21]强永妍,杨庚.中文垃圾邮件的索引分词法的研究与设计.计算机应用,2007,27(9):2334-2336
    [22] AT&T Laboratories.RFC2821.simple mail transfer protocol.USA: J. Klensin,2001
    [23]曹卫峰.中文分词关键技术研究:[硕士学位论文].南京:南京理工大学,2009,1-22
    [24]何莘库,王琬芜.自然语言检索中的中文分词技术研究进展及应用.情报科学,2008,26(5):787-791
    [25]龙树全,赵正文,唐华.中文分词算法概述.电脑知识与技术,2009,5(10) :2605~2607
    [26]李宝安,孟庆昌.中文信息处理技术——原理与应用[M].北京.清华大学出版,2005:130-135
    [27]何利益,郭罡,郭建斌.汉语分词索引字数与分词效率的对比研究.计算机应用,2008,44(26):135~137
    [28]孙茂松,左正平,黄昌宁.汉语自动分词自定机制的实验研究.中文信息学报,2000,14(1):1~6
    [29]崔明明.基于机器学习的中文分词的研究与实现(硕士学位论文),沈阳:沈阳工业大学,2009,6-8
    [30] Andrew Mccallum,Kamal Nigam.A Comparison of Event Models for Naive Bayes Text Classification.AAAI-98 Workshop on Learning for Text Categorization,1998.
    [31]孟兆玲,赵轶群.基于贝叶斯理论的垃圾邮件过滤技术综述.研究与开发,2007,(27):16-19
    [32] I.Androutsopoulos,G.Paliouras,E.Michelakis.Learning to Filter Unsolicited Commercial E-Mail.Technical report 2006,NCSR"Demokritos"
    [33]刘景春.数据流分类关键技术研究.佳木斯大学学报(自然科学版),2007,25(1):64-67
    [34] Quinlan J. R. Induction on decision trees[J].machine learning,1986,1,81-106
    [35] Alexey Tsymbal .The problem of concept drift: definitions and related work .Ireland :Department of Computer Science Trinity College Dublin, 2004,1-7
    [36] Stanley K.O., Learning concept drift with a committee of decision trees, Tech.Report UT-AI-TR-03-302, Department of Computer Sciences, University of Texas at Austin,USA, 2003.
    [37]Widmer G.,Kubat M., Effective learning in dynamic environments by explicit context tracking,Proc. 6th European Conf. on Machine Learning ECML-1993, Springer-Verlag,Lecture Notes in Computer Science 667, 1993, 227-243
    [38] Salganicoff M., Tolerating concept and sampling shift in lazy learning using prediction error context switching, AI Review, Special Issue on Lazy Learning, 11 (1-5), 1997, 133-155.
    [39] Guozhu Dong, Jiawei Han, Laks V.s. Lakshmanan, Jian Pei, Haixun Wang, Philip S. Yu. Online Mining of Changes from Data Streams: Research Prolems and Preliminary Results. ACM SIGMOD MPDS`03 San Deigo, CA, USA.
    [40] JZ Kolter, MA Maloof. Dynamic Weighted Majority: A New Ensemble Method for Tracking Concept Drift. Proceedings of the Third IEEE International Conference on Data Mining, 2003
    [41]张一.基于训练数据合并的流数据分类算法:[硕士学位论文],北京:清华大学,2005,1-35
    [42] Widmer G., Kubat M., Learning in the presence of concept drift and hidden contexts, Ma- chine Learning, 23 (1), 1996, 69-101.
    [43] Klinkenberg R., Learning drifting concepts: example selection vs. example weighting, Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8 (3), 2004
    [44] Cunningham P., Nowlan N., Delany S.J., Haahr M., A case-based approach to spam filtering that can track concept drift, Proc. ICCBR-2003 Workshop on Long-Lived CBR Systems, 2003.
    [45] Schlimmer J.C., Granger R.H.,Incremental learning from noisy data, Machine Learning, 1986, 1(3), 317-354.
    [46] Harries M., Sammut C., Horn K.,Extracting hidden context, Machine Learning, 32(2), 1998, 101-126.
    [47] Street W., Kim Y., A streaming ensemble algorithm (SEA) for large-scale classification, Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining KDD-2001, ACM Press, 2001, 377-382.
    [48] Wang H., Fan W., Yu P.S., Han J., Mining concept-drifting data streams using ensemble classifiers, Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and DataMining KDD-2003, ACM Press, 2003, 226-235.
    [49] Kolter J.Z., Maloof M.A., Dynamic weighted majority: a new ensemble method for tracking concept drift, 3rd IEEE Int. Conf. on Data Mining ICDM-2003, IEEE CS Press, 2003, 123130.
    [50] Haixun Wang,Wei Fan,Philip S.YU,et al,Ming concept-drifting data streams using ensemble classifiers,The 9th ACM Int’I conf on knowledge discovery and data mining (SIGKDD),Washington,2003,226-235
    [51]CHU F,ZANIOLO C. Fast and Boosting for asaptive mining of data stream.proc of the 5th Pacific-Asia Conference on Knowledge Discovery and data mining ,Sydney:[s.n.],2004:282-292

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700