基于神经网络集成的垃圾邮件过滤系统设计

英文题名：The Design of Spam Filtering System Based on Neural Network Ensemble
作者：刘宝萍
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：邮件过滤 ; 预处理 ; 特征选择 ; 神经网络集成
英文关键词：Spam Filtering ; Preprocess ; Feature Selection ; Neural Network Ensemble
学位年度：2010
导师：李爱军
学科代码：081203
学位授予单位：山西财经大学
论文提交日期：2010-06-02

摘要

网络的推广与应用使得电子邮件已经成为人们信息交流的重要手段,但随之而来的垃圾邮件问题严重影响人们的生产和生活。垃圾邮件过滤技术的研究具有十分重要的意义。目前存在的垃圾邮件过滤技术存在诸多不足,不能完全地将垃圾邮件过滤掉。为了达到将垃圾邮件完全过滤的理想状况,需要研究一种更加有效的垃圾邮件过滤技术,提高邮件分类的准确率。
     集成可以提高分类器分类的准确率。在目前应用于垃圾邮件过滤的机器学习方案中,神经网络是比较有效的方法之一。但是,神经网络容易陷入局部极小值,造成邮件的误分。因此,将神经网络进行集成,采用神经网络集成技术将多个不同的神经网络单分类器组合成一个分类器,集成的输出由构成集成的各神经网络的输出共同决定。基于该思想来提高学习的系统的泛化能力,提高过滤系统的过滤性能。本文就此方面进行研究。
     本文设计的邮件过滤系统模型由邮件预处理、特征提取、分类器设计三个部分组成。其中,邮件预处理把标准邮件语料库中的数据表示为计算机容易识别和处理的向量空间模型(VSM)形式;特征提取采用信息增益(IG)算法降低了数据的维数,提高了算法的运行效率;分类器设计采用神经网络集成的方法Boosting和Bagging来构造邮件分类器,通过组合多个单分类器的输出结论的方式训练分类器,确定邮件的类别,对垃圾邮件进行过滤。在垃圾邮件语料库PU系列语料库上分别进行了实验。除传统评价指标外,本文还采用混淆矩阵(Confusion Matrix)的评价方法,通过与单分类器RBF神经网络的过滤性能比较,证明了神经网络集成对于垃圾邮件的过滤有较好的效果。
It makes the e-mail have become an important means of information exchange to population and application of the network. However, problem of spam seriously affect people's production and life. The research of the spam filtering technology is of great significance. There are many inadequacies in the existing spam filtering technologies currently; the spam can not be completely filtered out. In order to achieve a complete filtering spam ideal situation, it needs to study a more effective spam filtering technology so as to improve the e-mail classification accuracy.
     Ensemble can improve the classification accuracy of classifier. Neural network is more effective one of the methods which are used in machine learning programs currently. However, neural network is easy to fall into local minimum, assigning an e-mail to the wrong category. Neural network ensemble combines a number of different neural networks into a single classifier, and its output is decided to the integration of various neural networks. Based on the idea to improve the generalization ability of learning systems, improve the filtration performance of filtration systems. This paper will be studied in this respect.
     The spam filtering system model designed in this paper includes three parts of preprocess, feature selection, classifier design. Preprocess treats the standard e-mail corpus of data as the form of vector space model (VSM) that the computer can identify and handle easily. Feature selection uses information gain (IG) algorithm reducing the data dimension, and improving the operational efficiency of the algorithm. Classifier design constructs an e-mail classifier and filters a spam, using Boosting and Bagging of neural network ensemble methods. The category of an e-mail is defined by combining the output of multiple single-classifier approach. It experiments on the PU series corpus of spam and compares with a single classifier with the RBF neural network. It uses an evaluation method based on Confusion Matrix in addition to traditional evaluation indicators, proving that neural network ensemble is more effective in filtering a spam.

引文

【1】http://book.csdn.net/bookfiles/983/10098330460.html
    【2】http://www.prizecn.com/Html/?11750.html
    【3】付彬.多分类器组合中的基分类器选取方法[D].北京交通大学. 2009,6
    【4】Salton G,Wong A,Yang CS. A Vector Space Model for Automatic Indexing[J]. Communication of the ACM,1975,18(5):P613-620
    【5】Y.Yang.“A Comparative Study on Feature Selection in Text Categorization”. Proceedings of the Fourteenth International Conference on Machine Learning,1997
    【6】蒋秋香等.垃圾邮件过滤技术的发展与现状[J].网络与通讯安全. 2007,9
    【7】Hansen L K,Salamon P.“Neural network ensembles”. IEEE Transat tern Analysis and Machine Intelligence,12(10):993-1001.1990
    【8】Hansen L K,Liisberg L,Salamon P.“Ensemble methods for handwritten digit recognition”. IEEE Workshop on Neural Networks for Signal Processing Copenhagen,Denmark.333-342.1992.
    【9】Cherkauer K J“.Human expert level performance on a scientific images task by a system using combined artificial neural networks”. In Proc the 13th AAAI Workshop on International Multiple Learned Models for Improving and Scaling Machine Learning Algorithms , Portland ,OR.15-21.1996
    【10】Gutta S.,Wechsler H.“Face recognition using hybrid classifier system”. In Proc the IEEE international Conference on Neural Networks,Washington,DC.1017-1022.1996.
    【11】Shimshoni Y,Intrator N“.Classification of seismic signals by integration ensembles of neural networks”.IEEE Trans Signal Processing,1194-1201.1998
    【12】Gutta S,Huang J R J,Jonathon P,WechslerH. Mixture of experts for classification of gender,ethnic origin,and pose of human faces[J]. IEEE Team Neural Networks,2000,11(4),P948-960.
    【13】Huang F J.,Zhou Z-H,Zhang H-J,Chen T.“Pose in variant face recognition”. In Proc the 4th IEEE International Conference on Automatic Face Recognition,Grenoble,France.245-250.2000
    【14】R. E. Schapire,Y. Singer.“Boostexter:A boosting-based system for text categorization”. Machine Learning,2000.39(2/3):P135-168.
    【15】S.Merler. C. Frulanello. B. Larcher.“Tuning cost-sensitive boosting and its application to melanoma diagnosis”. In Proceedings of the 2nd International Workshop on Multiple Classifier Systems MCS2001,volume 2096 of LNCS,Springer.32-42.2001
    【16】Greg Ridgeway. Looking for lumps:boosting and bagging for density estimation[J]. Computational Statistics & Data Analysis,2002,38(4):P379-392.
    【17】Eiji Takimoto,Akira Maruoka. Top-down decision tree learning as information based boosting[J].Theoretical Computer Science,2003,P447-464.
    【18】Torsten Hothom,Berthold Lausen. Bagging tree classifiers for laser scanning images:a data and simulation-based strategy[J]. Artificial Intelligence in Medicine,2003,P65-79.
    【19】Sharkey A J C. Sharkey N E,Cross S S.“Adapting an ensemble approach for the diagnosis of breast cancer”. Proceedings of the International Conference on Artificial Neural Networks,Sweden.281-286.1998.
    【20】Hampshire J,Waibel A. A novel objective function for improved phoneme recognition using time-delay neural networks[J]. IEEE Transactions on Neural Networks,1990.1(2):P216-228.
    【21】Partalas I,Tsoumakas G,Hatzikos E,Vlahavas I. Greedy regression ensemble selection:Theory and an application to water quality prediction [J]. Information Sciences,2008,178(20):P3867-3879.
    【22】Y.Yang.“A Comparative Study on Feature Selection in Text Categorization”. Proceedings of the Fourteenth International Conference on Machine Learning,1997.
    【23】Fabrizio Sebastiani. Machine Learning in Automated Text Categorization[J]. ACM Computing Surverys,2002,34(1),P1-47
    【24】Mlademnic,D.,Grobelnik,M.“Feature Selection for unbalanced class distribution and Bayes”. Proceedings of the Sixteenth International Conference on Machine Learning,Bled,Morgan Kanfmann.258-267.1999.
    【25】Mingyu Lu,LiLi Diao,Yuchang Lu,Lizhu Zhou.“The Design and Implementation of an Excellent Text Categorization System”. Intelligent Control and Automation,Proceedings of the 4th World Congress,459-463.2002.
    【26】Evgeniou T,Pontil M,Plggio T. Regularization Networks and Support Vector Machines[J]. Advances in Computational Mathematics,2000,13(1)P1-50
    【27】Yang Y,Pedersen JP.“A Comparative Study on Feature Selection in Text Categorization”.In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97). CA:MORGAN KAUFMANN PUBISHERS,412-420.1997.
    【28】Wiener E.,Pedersen J.0. and Weigend A.S.“A neural network approach to topic spotting”. In Proceedings of the Fort Annual Symposium on Document Analysis and Information Retrieval,1995
    【29】洪艳芬.基于网格的垃圾邮件过滤系统的研究与应用[D].南昌大学. 2007,12
    【30】孙丽华,张积东,李静梅.一种改进的NKN方法及其在文本分类中的应用[J].应用科技. 2002,2. P225-27
    【31】唐敏.垃圾邮件过滤技术研究[D].西华大学. 2006,5
    【32】盛鹏.基于全文过滤的垃圾邮件防范机制[D].昆明理工大学. 2006,3
    【33】X. Carreras,L. Marquez.“Boosting Trees for Anti-Spam Email Filtering”. In proceedings of Euro Conference Recent Advances in NLP (RANLP22001).Sept.58-64.200l.
    【34】M. Desouza,J.Fitzgerald. G. Truong. A Decision Tree based spam Filtering Agent [EB/OL]. http://www.cs/.mu.oz.au/481/2001-projects/gntr/index.html,2001
    【35】M.Desoza,J. Fitzgerald,C.Kemp and G.Truong. A Decision Tree Based Spam Filtering Agent[J].2001
    【36】刘洋等.垃圾邮件的智能分析、过滤及Rough集讨论[J].第十二届中国计算机学会网络与数据通信学术会议.武汉. 2002,12. P541-546
    【37】W.Cohen“.Learning rules that classify email”. In Proceedings of the AAAI spring symposium of Machine Learning in Information Access,Palo Alto,California.18-25.1996
    【38】H.Ducker,D.Wu,and V.N.Vapnik.“Support Vector Machines for Spam Categorization”. IEEE Transactions on Neural Networks,Vol.20,No.5,1048-1054.1999.
    【39】胡锡衡.垃圾邮件的分析与过滤[D]. 2008,3
    【40】Paul Graham. A Plan for Spam [EB/OL]. http://www.paulgraham.com/spam.html,2007-1-27
    【41】Paul Graham. Better Bayesian Filtering [EB/OL]. http:1//www.paulgraham. com/better.html,2007-1-27
    【42】Paul Graham. Will Filters Kill Spam? [EB/OL]. http://www.paulgraham. com/wfks.html,2007-1-27
    【43】Mehran Sahami,Susan Dumais,David Heckerman Eric Horvitz.“A Bayesian Approach to Filtering Junk E-mail”. Learning for Text Categorization,Papers from AAAI Workshop,Madison,Wisconsin.55-62.1998.
    【44】I. Androutsopoulos,J.Koutsias,K.V. Chandrinos,C.D.Spyropoulos.“An Experimental Comparison of Na?ve Bayesian and Keywork-Based Anti-Spam Filtering with Encrypted Personal E-mail Messages”. Proceedings of the 23rd Annual International ACM SIGMR Conference on Research and Development in Information Retrieval (SIGIR2000),Athens,Greeee.160-167.2000.
    【45】Burges C J C. A Tutorial on Support Vector Machines for Pattern Recognition[J]. Data Mining and knowledge Discovery,1998,2(2):P121-167.
    【46】Scholkopf B. Support Vector Learning[J]. R. Olenbourg Verlag,Munieh.1997.
    【47】Harris Durcker,Senior Member Donghui Wu,Vladimir N“.Vapnik Support Vector Machines for Spam Categorization”. IEEE Transactions on Neural Networks,Oct.1999.
    【48】Vapnik V N.统计学习理论的本质[D].张学工译.清华大学出版社. 2000
    【49】张学工.关于统计学习理论与支持向量机[D].自动化学报. 2000,26. P32-42
    【50】Androutsopoulos,G. Paliouras,E Michelakis. Learning to filter unsolicited commercial e-mail[R],Demokritos,2004
    【51】Aleksander Kolcz and Joshua Alspctor.“SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs”. In Proceedings of the TextDM'01Workshop on Text Mining held at the 2001 IEEE Antinational Conference on Data Mining,2001.
    【52】Salton G,Wong A,Yang C S.“A Vector Space Model for automatic indexing”. Communications of the ACM,18(11):613-620.1995.
    【53】S.E.Robertson“.The Probability Ranking in IR”. Readings in Information Retrieval,Morgan Kaufmann,281-286.1997.
    【54】I. Androutsopoulus,G. Paliouras,et al.“Learning to Filtering Spam E-mail:A Comparison of a Na?ve Bayesian and a Memory-Based Approach[C]”.In Proceedings of 4th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD2000),1-13.2000
    【55】康军.基于径向基函数神经网络的应用研究[D].湖北师范大学. 2009,3
    【56】Clark J.,Koprinska I.,Poon J.“A Neural Network Based Approach to Automated E-Mail Classification”. In Proceeding of the IEEE/WIC imitational conference on Web Intelligence (WT03),IEEE Computer Society,2003
    【57】郭照斌. Bogging算法神经网络的抗噪声能力及应用[D].大连理工大学.2009,6
    【58】Valiant L G. A theory of the learnable[J]. Communications of the ACM,1984,P1134-1142.
    【59】Schapire R E. The strength of weak learnability[J]. Machine Learning,1990,P197-227.
    【60】Freund Y. Boosting a weak algorithm by majority[J].Information and Computation,1995,P256-285.
    【61】Freund Y,Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences,1997,P119-139.
    【62】周志华等.神经网络集成[J].计算机学报. 2002,1
    【63】Kearns M.,Valiant L G.“Cryptographic Limitation on Learning Boolean Formula and Finite Automata”. Proceedings of the 21st Annual ACM Symposium on Theory of Computing,New York,433-444.1989
    【64】D.H. Wolpert.“Stacked Generalization”. In proceedings of Neural Networks,Vol.241-259.
    【65】Breiman L. Bagging predictors[J]. Machine Learning,1996,P123-140.
    【66】Perrone M P,Cooper L N“.When networks disagree:Ensemble method for neural networks”. Artificial Neural Networks for Speech and Vision,New York,Chapman & Hall.126-142.1993
    【67】Opitz D,Shavlik J. Actively searching for an effective neural network ensemble[J]. Connection Science,1996,P337-353.
    【68】L.Lam,C.Y.Sun. Optimal Combination of Pattern Classifiers[J]. Pattern Recognition Letters,1995,P945-954.
    【69】Shafer G.A Mathematical Theory of Evidence Princeton University Press[J]. 1976
    【70】P. Verlinde,and G. Chollet.“Comparing decision fusion paradigms using k-NN based classifiers,decision trees and logistic regression in a multi-modal identity verification application”. In Proc.2nd International Conference on Audio-and Video-Based Biometric Person Authentication,Washington D.C.188-193.1999
    【71】郭红玲.多分类器选择关键技术的研究[D].江苏大学. 2008,12
    【72】孙伟.日文假名识别多分类器融合方法研究[D].哈尔滨工业大学. 2004
    【73】M. Surgeno.“Fussy measures and fuzzy integrals”. A Survey[C]. Fussy Automata and Decision Processes Amsterdam,North Holland,89-102.1977
    【74】Gerard Salton.“A Vector Space Model for Automatic Indexing”. Communications of the ACM,Vol.613-620.1995
    【75】Gerard Salton,Michael J,McGill. Introduction to Modern Information Retrieval[J].McGraw-Hill Book Company,1983,P20-26.
    【76】牛伟霞等.潜在语义索引方法在信息过滤中的应用.计算机工程与应用[J]. 2001,9
    【77】杨丽华.基于内容的垃圾邮件过滤技术研究[D].西南交通大学. 2006,5
    【78】Opitz D,Maclin R. Popular ensemble methods:An empirical study[J]. Journal of Artificial Intelligence Research,1999,P169-198.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700