基于模糊支持向量机的垃圾邮件过滤技术研究

英文题名：The Research of Spam Filtering Technology Based on Fuzzy Support Vector Machines
作者：赵海涛
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：垃圾邮件 ; 支持向量机 ; 模糊隶属度 ; 模糊支持向量机 ; 误分损失
英文关键词：Spam ; SVM ; Fuzzy Membership ; FSVM ; Loss of Misclassification
学位年度：2010
导师：魏延
学科代码：081202
学位授予单位：重庆师范大学
论文提交日期：2010-03-01

摘要

随着互联网的迅速发展,电子邮件作为一种现代通信手段受到广泛使用,但人们在享受电子邮件带来的种种便利的同时,也受到了大量垃圾邮件的骚扰。支持向量机(SVM)是基于统计学习理论的机器学习方法,由于它是基于结构风险化最小的,具有小样本、泛化能力强、全局最优等优点。SVM方法已被成功地运用于许多领域,在垃圾邮件过滤等领域也成为了一个研究热点。
     本文通过研究学习了电子邮件的工作原理,邮件格式和邮件预处理技术,得到了邮件过滤前的向量表示。还重点学习研究了支持向量机和模糊支持向量机方法,把模糊支持向量机技术引入到垃圾邮件过滤中来,设计了一种新的模糊隶属度函数,考虑了合法邮件误分造成的严重后果引入了不同的惩罚参数C。最终提出了一种基于误分损失的FSVM垃圾邮件过滤方法,并进行了仿真实验。
     主要研究内容如下:
     1)研究了电子邮件工作原理,邮件相关协议和电子邮件预处理技术。重点研究了特征提取以及邮件的向量表示:使用正向最大匹配法和逆向最大匹配法相结合的方法对邮件文本进行中文分词,通过文档频率方法进行特征选择,使用TF_IDF函数建立向量空间模型。
     2)研究对比了基于支持向量机的邮件过滤技术和其它的邮件过滤技术。基于支持向量机的垃圾邮件过滤技术具有小样本、泛化能力强和全局最优等优点,但是也有两个明显的缺陷:邮件分类实际上是一个不确定信息的处理问题,SVM方法却把它当做确定性问题处理的,另外基于SVM的方法错分合法邮件和垃圾邮件的概率是等同的,忽略了错分合法邮件问题较错分垃圾邮件更严重的问题。
     3)把模糊支持向量机技术引入到垃圾邮件过滤中来,并重点研究了模糊支持向量机的隶属度函数和惩罚因子,设计出新的基于类中心的模糊隶属度函数,提出了一种基于错分损失的FSVM垃圾邮件过滤方法。
     4)研究和设计更适当的邮件过滤评价方法:LP、LR、WR等,重点使用合法邮件的查全率LR和其它综合指标作为评价手段,进行仿真实验对比所提的FSVM方法和SVM方法的过滤性能。
     仿真实验的结果证明考虑了误分损失的模糊支持向量机垃圾邮件过滤方法在保证了较高的垃圾邮件拦截率的同时,保证了较高的合法邮件查全率,有效解决了错分合法邮件带来的严重后果,证明了所提方法的可行性和有效性。
With the rapid development of the Internet, as a modern means of communication E-mail is used widely. But with the widespread use of e-mail, people enjoy the convenience brought by e-mail, but also by a lot of spam. Support Vector Machine (SVM) is a machine learning method based on statistical learning theory, which based on the smallest structural risk and has some advantages such as small samples, generalization ability and the advantages of global optimization. SVM method has been successfully used in many fields, and has also been a hot research topic In the areas of spam filtering.
     In this paper through studying the principle of the e-mail, mail format and the technology of e-mail pretreatment we get the expression with vectors of e-mail for e_mail filtering. And focus on learning of the technology of support vector machine and fuzzy support vector machine, introduces the technology of fuzzy support vector machine into spam filtering in the Lai, and design a new function of fuzzy membership, and consideres the serious consequences of legitimate e_mail’s misclassification and use different penalty parameters C. Finally, proposing a methods of FSVM spam filtering based on the loss of misclassification, and conducte a simulation experiment.
     The main research contents and innovations of this paper are as follows:
     1) Research the working principle of the e-mail, related protocols of e_mail and e-mail pretreatment. Focuses on the feature extraction and vector express of e_mail: using the method which combined the forward maximum matching with the reverse maximum matching method to separate chinese vocabulary on the e_mail text, and using the method of Document-Frequency to select features, and finally using the function of TF_IDF to build a vector space model.
     2) Study and contrasted the technology of e_mail filtering based on support vector machine and other mail filtering technology. The technology of spam filtering based on Support vector machine has some advantages such as small samples, generalization ability and the advantages of global optimization, but there are two obvious problems: Mail filtering actually is an uncertain information processing problem, the method based on SVM treats it as a certain one; On the other hand the rate of misclassifying legitimate mails and spam by the approach based on SVM is equivalent ,which ignores the matter that misclassifying a legitimate mail is more serious than misclassifying spam.
     3) Introduced The technology of fuzzy support vector machine into spam filtering, and focuses on fuzzy membership function of support vector machines and penalty parameters, designed a new fuzzy membership function based on class center, proposing a methods of FSVM spam filtering based on the loss of misclassification.
     4) Research and design more appropriate methods of spam filtering evaluation,eg. LP、LR、WR, mainly using the recall of legitimate mail recall and other comprehensive indicators as evaluation tools.Conducted a simulation experiment to compare the performance of the method we proposed based on FSVM and SVM in spam filtering.
     The results of Simulation prove that the method based on FSVM considered the loss of misclassification of spam filtering method ensure a high rate of spam filtering, ensure a high recall rate of legitimate e-mail in the same time, which resolve the problem that the result of misclassifing a legitimate e-mail is more serious than misclassifing spam.

引文

[1]美国人口总数已达3.07亿[EB/OL]. http://news.qq.com/a/20091224/001316.htm.新华网华盛顿2009年12月23日电.
    [2]第二十四次中国互联网发展状况统计报告(2009.7)[R].中国互联网信息中心..
    [3]中国网民总数达到3.6亿[EB/OL] http://www.cnr.cn/allnews/200911/t20091124 _505664050.html..中国广播网2009-11-24.
    [4] 2005～2008年反垃圾邮件报告[EB/OL]. http://www.anti-spam.cn/MoreArticle.php? ClassID=38.中国互联网协会反垃圾邮件中心.
    [5]中国互联网协会反垃圾邮件规范[EB/OL]. http://www.isc.org.cn/20020417/ca134119.htm.
    [6] 2009年7月份垃圾邮件比例为89%[ EB/OL]. http://www.topoint.com.cn/html/article/2009/ 08/247640.html.支点网2009-8-7.
    [7]垃圾邮件危害多[EB/OL]. http://tech.sina.com.cn/i/2007-09-11/13561732517.shtml.东方网-文汇报.
    [8]时红梅,高茂庭.垃圾邮件过滤技术及发展[J].计算机与数字工程,2008(6): 128-132.
    [9]实时黑名单技术[EB/OL].http://www.anti-spam.org.cn/AID/16.中国互联网协会反垃圾邮件中心.
    [10]杨峰,曹麒麟,段海星等.基于DNS Block list的反垃圾邮件系统的设计与实现[J].计算机工程与应用,2003,7:11-12.
    [11]黄诠.垃圾邮件过滤技术研究与发展[J].网络通讯及安全, 2008,16:1218.
    [12]唐敏.垃圾邮件过滤技术研究[D].成都:西华大学,2006.
    [13] Hall RJ. Chaneels:Avoiding Unwanted Electronic Mail[M], Communications of ACM, 1998,3.
    [14]肖明,殷锋,张楠.垃圾邮件过滤技术及发展[J].西南民族大学学报自然科学版,2007,33(1):207-212.
    [15] Cohen W.W.Learning Rules that Classify E-Mail[A].In AAAI Spring symposium on Machine Learning in information access[C].Califormia:IOS press, 1996:18-25.
    [16]陈华辉.一种基于潜在语义索引的“垃圾邮件”过滤方法[J].计算机应用研究,2000,3:10.
    [17] M.DeSouza,J.Fitzgerald,C.Kemp and G.Truong.A Decision Tree Based Spam Filtering Agent,(2001)[EB].
    [18] Carreras X.,Marquez L. Boosting Trees for Anti-Spam E-mail Filtering [C].in Proceedings Adcances in NLP(RANLP-2001),2001:58-64.
    [19]于洪,李志君,唐宏等.电子邮件过滤系统的粗糙集分析模型[J].计算机工程应用,2003(15): 47-48,67.
    [20]李文斌,刘春年,黄佳进.基于数据挖掘的垃圾E-mail过滤方法[J].北京工业大学学报, 2003,29(2):237-240.
    [21] Sahami M,Sumais S,Heckermon D,et al.A Bayesian Approach to Filtering Junk E-mail[A]. Proceeding of AAAI-98 Wordshop on Learning for Text Categorization[C],1998:55-62.
    [22] Drunker,Wu Donghui,Vapnik V N.Suport Vector Machines for spam categorization[J].IEEE Transcation on Neural Networks.1999,10(5):1048-1054.
    [23] Rennie J D M.ifile :An application of machine learning to E-mail filtering[J].In Proceeding of KDD-2000 Text Mining Workshop,(2000)8.
    [24] Paul Gramham.A plan for spam[C].In Reprinted in Paul Graham,Hackers and Painters,Big Ideas from the Computer Age,O’Really,2002.
    [25]蔡立军,施荣华.一种新的电子邮件过滤系统模型的设计[J].计算机工程, 2003,16:167-169.
    [26] Vapnik V N. The Nature of Statistical Learning Theory [M]. New York,USA: Spring-Verlag, 1995.
    [27] (美)Vapnik V N.著.统计学习理论[M].北京:电子工业出版社,2009.
    [28] Brutlag J,Meek J.Challenges of the Email Domain for Text Classification[C].In Proceedings of the 17th Internathional Conferrnce on Machine learning ICML00, 2000:103-110.
    [29] Aleksander Kolcz,Joshua Alspector.SVM-based Filtering of Email Spam with Content-specific Misclassification Costs.In Proceedings of the Text DM01 Workshop on Text Mining– held at the 2001 IEEE International Conference on Data Mining, 2001.
    [30]肖明,刘乃琦.支持向量机在邮件过滤中的应用[C].中国西部青年通信学术会议,成都, 2004:611-614.
    [31]肖明.基于SVM的智能邮件过滤系统研究与实现[D].成都:电子科技大学, 2005.
    [32]边肇祺,张学工.模式识别第2版[M].北京:清华大学出版社,1999.
    [33] I.Androutsopoulos,G.Paliouras,et al.learning to Filtering Spam E-mail:A Comparison of a Naice Bayesian and a Memeory-Based Approach[C].IN Proceedings of 4th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD2000). Pp.1-13,Sep.2000.
    [34]赵向军,路梅.SVM和K-NN协同工作的垃圾邮件过滤器[J].江南大学学报, 2007.12(6): 846-849.
    [35] Clark J.,Koprinska I.,Poon J.,A Neural Network Based Approach to Automated E-mail Classification,In Proceeding of the IEEE/WIC internationnal conference on Web Interlligence(WI’03)[C],IEEE Computer Society,2003.
    [36]任劼,项婧.基于神经网络的电子邮件分类与过滤[J].计算机工程与设计, 2006,06(035):1021-1024.
    [37] Soonthornphisaj N.,Chaikulseriwat K.,et al.Anti-Spam Filtering:A Centroid- Based Classification Approach[A].ICSP02 Proceedings,2002:1096-1099.
    [38] Boone G.Concept features in Re:Agent,an intelligent email agent,Autonomous Agent[J],1998.
    [39] Vapnik V. and Lerner A.,Pattern recognition using generallized portrait method [J].Automation and Remote Control,1963(24).
    [40] Sch?lkopf B,Burgers C,Smola A.Advances in kernel methods:support vector learning[M].Cambridge,MA:MIT Press,1999.
    [41] Kuncheva L.Clustering-and-selection model for classifier combination [C].Proceedings of the 4th International Conference on knowledge-based Intelligent Engineering Systems (KES’2000),Volume 3(2000)1275-1280, Brighton,UK.
    [42]袁亚湘,孙文瑜.最优化理论与方法[M].北京:科学出版社,1997.
    [43]曹麒麟,张千里等.垃圾邮件与反垃圾邮件技术[M].北京:人民教育出版社,2003.
    [44]汪晓平,钟军等编著.Visual C++网络通信协议分析与应用实现[M],北京:人民教育出版社, 2003.
    [45]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005, vol.19 No.5.
    [46]李迪.基于内容的垃圾邮件过滤方法研究[D].合肥:合肥工业大学, 2008.
    [47] MIME [EB/OL]. http://baike.baidu.com/view/160611.htm .百度百科.
    [48]刘震.垃圾邮件过滤理论和关键技术研究[D].成都:电子科技大学,2007.
    [49] Lucene[EB/OL].http://groups-beta.google.com/group/SegWord/web/ IKAnalyzer.jar
    [50]刘慧,马军雷,景生.基于词频的权值计算在邮件过滤算法中的应用[J].计算机工程,2006,32(17).
    [51]单松巍.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003(22).
    [52]胡磊.基于内容的垃圾邮件过滤技术的研究[D].硕士学位论文.昆明理工大学,2005.
    [53]赵广杜,张希仁.基于主成分分析的支持向量机分类方法研究[J].计算机工程与应用,2004,3:37-38,144.
    [54] Lin C F,Wang S D.Fuzzy support vector machines[J].IEEE Transaction on Neural Networks,2002,13(2):464-471.
    [55] Lin C F,Wang S D.Fuzzy support vector machines with automatic membership setting[J].StudFuzz,2005,177:233-254.
    [56]张翔,肖小玲,徐光祐.模糊支持向量机中隶属度的确定与分析[J].中国图象图形学报,2006,11(8):1188-1192.
    [57]刘畅,孙德山.模糊支持向量机隶属度的确定方法[J].计算机工程与应用, 2008, 44(11):41-42.
    [58]胡宝清.基于模糊支持向量机的多类分类方法研究[D].武汉:武汉大学,2005.
    [59]张秋余,竭洋,李凯.基于模糊支持向量机与决策树的文本分类器[J].计算机应用. 2008.12:3227-3230.
    [60]杨霁琳.基于支持向量机的垃圾邮件过滤技术研究[D].成都:西华大学,2007.
    [61]哈明虎,彭桂兵等,一种新的模糊支持向量机[J].计算机工程与应用,2009, 45(25):151-153.
    [62]杜喆,刘三阳,齐小刚,一种新隶属度函数的模糊支持向量机[J],系统仿真学报, 2009,21(7):1901-1903.
    [63]中文邮件语料库[ EB/OL]. http://www.ccert.edu.cn/spam/sa/datasets.htm.中国教育和科研网紧急响应组(CCERT).

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700