垃圾邮件行为模式识别与过滤方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
电子邮件已经成为现代人际交流的一种最常见的方式。但是,SMTP(SimpleMail Transfer Protocol:简单邮件传输协议)协议的不完善,尤其是对电子邮件发送者没有做任何的身份鉴别和控制,使得垃圾邮件越来越泛滥。
     垃圾邮件过滤是个复杂的问题,虽然存在许多相关研究,也获得了很多成果,但是在技术上目前还没有哪一种方法能完美地过滤所有的垃圾邮件。随着伪装技术的发展,垃圾邮件也越来越隐晦,致使基于内容过滤的误判率也很高,而对大量疑似垃圾邮件,基于内容的过滤也耗费了大量的处理时间。因此,必须寻求新的方法和算法。
     提出了基于数据挖掘的行为识别垃圾邮件过滤系统框架。对采集的数据提取行为特征,并将行为特征分为会话行为特征、信头行为特征和统计行为特征,采用特征选择算法选择能够有效地预测训练数据类属性的特征,经数据预处理,从数据中挖掘出垃圾邮件行为判定规则的知识。
     提出了基于多级结构的垃圾邮件行为模式挖掘模型,针对不同类型的行为特征,采用不同的模式挖掘算法:对MTA(Mail Transport Agent:邮件传输代理)会话阶段的行为特征,提出了基于决策树的垃圾邮件发送行为识别模型。它不需要接收整封邮件,通过挖掘邮件会话过程中所表现出的行为特征,在会话阶段提前过滤掉垃圾邮件。对用户发送行为采用直方图距离法来检测异常用户发送行为。通过计算附件的指纹特征、统计特征,构建附件的特征向量,利用支持向量机模型来对垃圾邮件的附件行为建模。计算URL(uniform Resource Locator:统一资源定位)之间的相似度,构建包含相似URL的群组,通过计算样本与URL群组的最小距离并转换成分类输出的置信度来判别垃圾邮件行为。
     由于传统的贝叶斯垃圾邮件过滤在误判和漏判带来的损失方面没有进行关注,提出了一种贝叶斯算法的改进算法,引入了损失因子,在不降低正确率的情况下,使得垃圾邮件误判的风险减到最低。若选择合适的损失因子,正确率和召回率都能达到一个比较理想的效果。利用该算法将各模型判别结果关联起来,通过对联合贝叶斯模型和附件模型、发送发送行为模型、URL模型的性能比较,验证了改进的贝叶斯联合模型相对单个模型来说,能够较大地提高分类性能。
     提出了基于模糊决策树的分类方法。由于绝对明确的属性并不总是存在于现实世界中,属性隶属度能更自然和合理地描述行为特征,因此相对于清晰决策树来说,模糊决策树更适合。模糊决策树算法使得决策树学习的应用范围扩大从而能够处理不确定性,它合理地处理了学习和推理过程中的不精确信息,具有更强的分类能力及稳健性,由于能生成不同水平和不同置信度的规则,为决策者提供丰富的决策信息。
     设计了基于行为模式识别和其它过滤技术相结合的邮件过滤系统MailGate,并进行了原型实现。实验结果表明MailGate对垃圾邮件过滤的召回率和误判率能够达到较好的效果。
E-mail has become one of the most common manners in modern communication. However, imperfect SMTP(Simple Transfer Protocol) protocol, especially no authentication and controlling for e-mail senders, has made spam flood.
     Spam filtering is a complex researching problem. Although many research has been made on that, and many achievements has gotten, but technically, there is no perfect solution can filter all the spam. With the development of camouflage technology , spam became more obscure, and lead to higher false positive rate for content-based filtering. For large number of suspected spam, content-based filters also spent so much time on processing. Therefore, we must find new methods and algorithms to solve the problem.
     The framework of spam filtering system based on mining behavior patterns is proposed. Extracting behavior features from collected data, and dividing behavior features into session features, message header features and statistical features, using feature selection algorithm to choose the features that can effectively predict training data class attribute, and after data preprocessing, knowledge of spam behavior determinant rules can be mined from the training data.
     A model of spam behavior patterns mining is proposed, and it is based on multi-level structure. For different types of behavioral features, different pattern mining algorithms have been used: for session features in MTA(Mail Transport Agent) stage, using Decision Tree for spammers' behavior recognition. It needn't to receive the entire message, and mines behavior patterns from features in the conversation, spam can be filtered in the early time of the session. Histogram distance method is used for user sending behavior to detect the abnormal sending behavior. Fingerprint features and statistical features of attachments are calculated to generate the feature vector, and Support Vector Machine model(SVM) used to model attachment behavior. By calculating URL(Uniform Resource Locator) similarity between URLs, similar URLs are grouped to URL clique. The minimum distance between the sample and other URL cliques is converted into the confidence level as the classifier output to determine spam behavior.
     A collaborative filtering model based on Bayesian algorithm is proposed, and the model correlates the results of the various models. Because traditional Bayesian spam filtering technology hasn't concerned about the loss of spam false negatives and false positives, an improved Bayesian algorithm is proposed. In the algorithm, the loss factor is introduced in the situation of no reducing the accuracy rate of filtering, to minimize the risk of spam false positives. If choosing the appropriate loss factor, the accuracy rate and the recall rate can be improved to ideal result. By comparing the performance with the new combining Bayesian model, the attachment model, the user sending behavioral model and URL model, corresponding to the single models, the improved Bayesian combining model can greatly improved the filtering ability.
     A classification method based on fuzzy decision tree is proposed. Because the absolutely clear attributes do not always exist in the real world, the attribute subordinating degree is more natural and reasonable to describe the characteristics of behavior, so corresponding to clear decision tree, the fuzzy decision tree is more suitable. Fuzzy decision tree algorithm expands the scope of application of decision tree, and can handle uncertainty. It can deal with the inaccurate information in the process of learning and influence with stronger classification ability and robustness. It can generate rules with different level and different confidence degree, and provide decision makers with full determinate information.
     Based on the combining technology of behavior-based pattern recognition and other e-mail filtering technology, the filtering system MailGate is designed and implemented. Experiments show that the recall rate and FP rate of spam filtering get a good result.
引文
[1]王耿.反垃圾邮件算法的设计和实现:硕士学位论文.北京邮电大学,2006
    [2]王平.基于内容过滤的反垃圾邮件技术研究:硕士学位论文.北京邮电大学,2006
    [3]Gyongyi,Z.Gareia-Molina,H.Spam:it's not just for inboxes anymore.IEEE Computer,2005,38(10):28-34
    [4]Sakkis,G.,Androutsopoulos,I.,Olaiouras,G.,et al.Stacking classifiers for anti-Spam filtering of e-mail.In:Proceedings of Conference on Empirical Methods in Natural Language Processing.Carnegie Mellon University,Pittsburgh,PA,USA.2002:56-67.
    [5]陈勇,李卓桓.反垃圾邮件完全手册.北京:清华大学出版社,2006
    [6]李星,田莹,段海新.中文垃圾邮件过滤系统的实现和评估.大连理工大学学报,2005,S1:189-195
    [7]Schryen,G.A Formal Approach towards Assessing the Effectiveness of Anti-Spam Procedures.In:Proceedings of the 39th Annual Hawaii International Conference on Systems Sciences,2006,6:129-138
    [8]B.Postel.Simple Mail Transfer Protocol,http://www.ietf.org/rfc/rfc821.txt,1982
    [9]Jon Postel.On the Junk Mail Problem.RFC706,http://www.rfc-editor.org/rfcs/rfc706.html,November 1975
    [10]Gray,A.and Haahr,M.Personalised,collaborative spam filtering.In:Proceedings of the First Conference on Email and Anti-Spam(CEAS).2004
    [11]Kleinberg,J.,Sandler,M.Using mixture models for collaborative filtering.In:Proceedings of the 36th ACM Symposium on Theory of Computing.2004
    [12]Crawford,E.,Kay,J.,and McCreath,E.Automatic induction of rules for e-mail classification.In:Proceedings of Sixth Australian Document Computing Symposium,Coffs Harbour,Australia.,2001.
    [13]Cohen W.W.Learning Rules that Classify E-Mail.In:Proceedings of AAAI Spring symposium on Machine Learning in Information Access.California:IOS press,1996.18-25
    [14]DIAO Y.,LU H.,WU D..A Comparative Study of Classification Based Personal E-Mail Filtering.In:Proceedings of the Fourth Pacific-Asia Conf on Knowledge Discovery and Data Mining.Keihanna Plaza,Kyoto,Japan,2000(4):18-20.
    [15]Carreras X.,Marquez L.Boosting Trees for Anti-Spam E-mail Filtering.In:Proceedings of Euro Conference Recent Advances in NLP(RANLP-2001).2001:58-64
    [16]Y.FREUND,SCHAPIRE R E.Experiments with a new Boosting Algorithm.In:Proceedings of Machine Learning:Proc.13th Int.Conf.San Mateo,CA:Morgan Kaufrnann.1996:148-156
    [17]R.E.Schapire and Y.Singer.BoosTexter:A boosting-based system for text categorization.Machine Learning,2000.39(2/3)
    [18]Eibe Frank and Ian H.Witten.Generating Accurate Rule Sets Without Global Optimization.In:Shavlik,j.,ed.,Machine Learning:Proceedings of the Fifteenth International Conference,Morgan Kaufmann Publishers.San Francisco,CA,1998
    [19]J.M.Gomez Hidalgo,M.Ma na Lopez,and E.Puertas Sanz.Combining text and heuristics for cost-sensitive spam filtering.In:Proceedings of the Fourth Computational Natural Language Learning Workshop,CoNLL-2000.Association for Computational Linguistics,2000
    [20]J.M.Gomez Hidalgo.Evaluating cost-sensitive unsolicited bulk email categoryzation.In:Proceedings of SAC-02,17th ACM Symposium on Applied Computing.Madrid,ES,2002:615-620
    [21]李文斌,刘椿年,黄佳进.基于数据挖掘的垃圾E-mail过滤方法.北京工业大学学报,2003,29(2):237-240
    [22]于洪,李志君,唐宏,等.电子邮件过滤系统的粗糙集分析模型.计算机工程与应用,2003(15):47-48,67
    [23]朱骏,陈刚.一种高效的智能内容过滤模型.计算机工程,2003,29(21):146-148
    [24]Zhang,L.,Zhu,J.,and Yao,T..An evaluation of statistical spam filtering techniques.ACM Transactions on Asian Language Information Processing(TALIP).2004.3(4):243-269.
    [25]Spertus,E..Smokey:Automatic recognition of hostile messages.In:Proceedings of AAAI/IAAI,1997.1058-1065.
    [26]Sahami,M.,Dumais,S.,Heckerman,D.,and Horvitz,E..A bayesian approach to filtering junk e-mail.In:Proceedings of AAAI Workshop on Learning for Text Categorization.1998,55-62
    [27]I.Androutsopoulos,G.Paliouras,E.Michelakis.Learning to Filter Unsolicited Commercial E-Mail.Technical report 2004/2,NCSR "Demokritos",2004
    [28]K.Schneider.A Comparison of Event Models for Na(i|¨)ve Bayes Anti-Spam E-Mail Filtering.In:Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics(EACL'03).Budapest,Hungary.Apr.2003,307-314
    [29]潘文峰.基于内容的垃圾邮件过滤研究:硕士毕业论文.中国科学院计算技术研究所,2004.7
    [30]J.Rocchio.Relevance feedback in information retrieval.In the SMART Retrival System:Experiments in Automatic Document Processing.PrenticeHall Inc.,1971,313-323
    [31]V.Vapnik.Statistical Learning Theory.Wiley,1998,10-12
    [32]Drucker H,Donghui WW,Vapnik V N.Support vector machines for spam categorization[J].Neural Networks,IEEE Transactions,1999,10(5):1048-1054.
    [33]A.Kolcz and J.Alspector,SVM-based Filtering of E-mail Spam with Contentspecific Misclassification Costs,In Proceedings of ICDM'01 Workshop on Text Mining(TextDM'01).Nov.2001
    [34]肖明,刘乃琦.支持向量机在邮件过滤中的应用[C].中国西部青年通信学术会议,成都,2004:611-614
    [35]肖明.基于SVM的智能邮件过滤系统研究与实现[D].硕士学位论文,成都:电子科技大学,2005
    [36]周念念,冉蜀阳,曾剑宇等.基于人工免疫的反垃圾邮件系统模型.计算机应用,2005,25(11):62-65
    [37]XUN YUE,AJITH ABRAHAM,ZHONG XIAN CHI.Artificial Immune System Inspired Behavior-Based Anti-Spam Filter.Springer,2006,11(8):729-740
    [38]王波,黄迪明.遗传神经网络在邮件过滤器中的应用.电子科技大学学报,2005,34(4):505-508
    [39]ZHAN CHUAN,LU XIAN LIANG,HOU MENG SHU.A LVQ-Based Neural Net-work Anti-Spam Email Approach.ACM SIGOPS Operating Systems Review,2005,39(1):34-39
    [40]M.Bhattacharyya,M.G.Schultz,E.Eskin,et al.MET:An Experimental System for Malicious Email Tracking.In:Proceedings of the 2002 New Security Paradigms Workshop(NSPW-2002).2002.9
    [41]Salvatore J.Stolfo,Shlomo Hershkop,Ke Wang,et al.A Behavior-based Approach to Securing Email Systems.Springer Verlag,2003
    [42]Salvatore J.Stolfo,Shlomo Hershkop,Ke Wang,et al.Behavior Profiling of Email.In:Proceedings of NIJ Symposium on Intelligence and Security Informatics(ISI 2003).Tucson,Arizona,.2003
    [43]张耀龙.行为识别技术在反垃圾邮件系统中的研究与应用:硕士学位论文.北京邮电大学,2006
    [44]章璿.基于数据挖掘的垃圾邮件行为识别关键技术研究:硕士学位论文.北京邮电大学,2007
    [45]王申.基于内容的垃圾邮件过滤技术的若干研究.硕士学位论文.中国科学院计算技术研究所.2005.11
    [46]Joachims T.Text Categorization with Support Vector Machines:Learning with Many Relevant Features.In:Proceedings of the European Conference on Machine Learning,Berlin,Springer:1998,78-85
    [47]T.Nicholas.Using AdaBoost and Decision Stumps to Identify Spam E-mail,Stanford University Course Project Report,url:http://nlp.stanford.edu/ courses/cs224n/2003/fp/
    [48]I.Androutsopoulos,J.Koutsias,K.V.Chandrinos,et al.An Evaluation of Na(i|¨)ve Bayesian Anti-Spam Filtering.In:Proceedings of the Workshop on Machine Learning in the New Information Age,11th European Conference on Machine Learning(ECML'00).May 2000.9-17
    [49]Jiawei Han,Micheline Kamber.数据挖掘概念与技术.范明,孟小峰等译.北京:机械工业出版社,2001
    [50]高岩.基于行为模式分类的反垃圾邮件技术研究:硕士学位论文.哈尔滨理工大学,2008.3
    [51]Y.Yang.A Comparative Study on Feature Selection in Text Categorization.In:Proceeding of the Fourteenth International Conference on Machine Learning (ICML' 97).1997.412-420
    [52]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究.中文信息学报,2004,18(1):26-32
    [53]D.Koller,M.Sahami.Toward optional feature selection.In Proceedings of the Thirteenth International Conference on Machine Learning.1996
    [54]Y.Yang.An evaluation of statistical approach to text categorization.In Technical Report CMU-CS-97-127,Computer Science Department,Carnegie Mellon University,1997
    [55]熊志勇.数据挖掘在反垃圾邮件领域中的应用与研究:硕士学位论文.南昌大学,2006:39-48
    [56]AERY M.,CHAKRAVARTHY S.Mining-Based Approaches to Email Classifycation.In:Proceedings of the 27 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM Press,2004:580-581
    [57]AERY M.,CHAKRAVARTHY S.Mining-Based Approaches to Email Classifycation.In:Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM Press,2004:580-581
    [58]Xuan Zhang,Jianyi Liu,Yaolong Zhang.Spam Behavior Recog-nition Based on Session Layer Data Mining.www.informatik.uni-trier.de,2006:121-128.
    [59]A.Ramachandran,N.Feamster.Understanding the Network-Level Behavior of Spammers.In:Proceedings of the 2006 Conference on Applications,Technologies,Architectures,and Protocols for Computer Communications.ACM Press,2006:218-223.
    [60]Meizhen Wang,Zhitang Li,Ling Xiao,Yunhe Zhang.Research on Behavior Statistic Based Spam Filter.In:Proceedings of the 1st IEEE International Symposium on Education and Computer Science(ECS2009),the 1~(st) International Workshop on Education Technology and Computer Science.Wuhan,China,March 2009.Los Alamito,CA:the IEEE Computer Society,2009.2:687-691
    [61]林丹宁.反垃圾邮件关键技术与实现:硕士学位论文.浙江大学,2007
    [62]Shlomo Hershkop.Behavior-based Email Analysis with Application to Spam Detection.Doctor thesis in Columbia University,2006
    [63]Schultz,M.G.,Eskin,E.,and Stolfo,S.J.Malicious email filter-a unix mail filter that detects malicious windows executables.In:Proceedings of USENIX Annual Technical Conference-FREENIX Track,Boston,MA,2001
    [64]Bhattacharyya,M.,Hershkop,S.,Eskin,E.,and Stolfo,S.J.MET:An experimental system for malicious email tracking.In New Security Paradigms Workshop (NSPW-2002),Virginia Beach,VA,2002
    [65]Schultz,M.G.,Eskin,E.,Zadok,E.,and Stolfo,S.J..Data mining methods for detection of new malicious executables.In:Proceedings of IEEE Symposium on Security and Privacy,Oakland,CA,2001
    [66]D.D.Lewis.Naive Bayes at Forty:The Independence Assumption in Information Retrieval.In:Proceedings of the 10th European Conf on Machine Learning.1998.45-56
    [67]M.Sahami,S.Dumais,D.Heckerman,et al.A Bayesian Approach to Filtering Junk E-mail.In:Proceedings of AAAI Workshop on Learning for Text Categorization.1998.55-62
    [68]何智强.基于策略的反垃圾邮件技术.中国反垃圾邮件技术研讨会(Conference on China Anti-Spam technology,CCAS).2004
    [69]Cooper G F.A simple constrain-based algorithm for efficiently mining observational databases for causal relationships.Data Mining and Knowledge Discovery 1997,1(2):203-224.
    [70]Larkey,L.S.and Croft,W.B..Combining classifiers in text categorization.In:SIGIR-96:19th ACM International Conference on Research and Development in Information Retrieval,1996.Zurich.ACM Press,NY,US.289-297
    [71]Kittler,J.,Hatef,M.,Duin,R.P.,and Matas,J..On combining classifiers.IEEE Transactions on Patterns Analysis and Machine Intelligence,1998,20(3).
    [72]Bilmes,J.A.and Kirchhoff,K..Directed graphical models of classifier combination:application to phone recognition.In ICSLP-2000.2000,3:921
    [73]Tax,D.,van Breukelen,M.,Duin,R.,and Kittler,J..Combining multiple classifiers by averaging or multiplying? Pattern Recognition.2000,33(9):1475-1485.
    [74]Kittler,J.and Alkoot,F.M..Sum versus vote fusion in multiple classifier systems.In:IEEE Transactions on Patterns Analysis and Machine Intelligence.2003,25(1).
    [75]Tan,P.-N.and Jin,R..Ordering patterns by combining opinions from multiple sources.In:KDD '04:Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.New York,NY,USA.ACM Press,2004.695-700
    [76]Zheng,Z.,Padmanabhan,B.,and Zheng,H..A DEA approach for model combination.In KDD2004,Seattle,WA.2004
    [77]王美珍,李芝棠,吴汉涛.改进的贝叶斯垃圾邮件过滤算法.华中科技大学学报(自然科学版).2009.8,37(8):28-30
    [78]J.R.Quinlan.Induction of Decision Trees.Machine Learning.San Mateo,CA:Morgan Kaufmann,1986,1:81-106
    [79]J.R.Quinlan.C4.5:Programs for Machine Learning.San Matco CA:Morgan Kaufmann,1993
    [80]Wang Meizhen,Li Zhitang,Zhong Sheng.Fuzzy Decision Tree Based Inference Technology for Spam Behavior Recognition.In:Xiaofei Liao,Hai Jin,Ran Zheng, Deqing Zou eds.Proceedings of the 7th IEEE International Symposium on Parallel and Distributed Processing with Applications,International Workshop on Cyberspace Safety and Security(CSS2009).Chengdu,Sichuan,China.August 2009.Los Alamito,CA:the IEEE Computer Society,2009:463-468
    [81]Wang Meizhen,Li Zhitang,Zhong Sheng.A Method for Spam Behavior Recognition Based on Fuzzy Decision Tree.In:Proceedings of IEEE 9th International Conference on Computer and Information Technology(CIT2009),The International Workshop on Network Security and Trusted Computing (NSTC2009),.Xiamen,China.October 2009.Los Alamito,CA:the IEEE Computer Society,2009.2:236-241
    [82]A.Berson and S.J.Smith.Data Warehousing,Data Mining,and OLAP.New York:McGraw-Hill,1997
    [83]K.Cios,W.Pedrycz,and R.Swiniarski.Data Mining Methods for Knowledge Discovery.Boston:Kluwer Academic Publishers,1998.
    [84]王熙照,孙娟,杨宏伟,赵明华.模糊决策树与清晰决策树算法的对比研究.计算机工程与应用,2003,39(21):72-75
    [85]王金凤,王熙照.两种模糊决策树算法的对比研究.计算机工程与应用,2003,29:92-95.
    [86]Zeidler J,Schlosser M.Continuous-valued attributes in fuzzy decision trees.In:Proceedings of the 6th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems.1996:395-400
    [87]Peng Y.H.,Flach P A.Soft discretization to enhance t he continuous decision tree induction.In:Proceedings of t he ECML/PKDD'2001 Workshop IDDM'2001.Freiburg,Germany,2001
    [88]詹川.反垃圾邮件技术的研究:博士学位论文.电子科技大学,2005.
    [89]Guetova M,Holldobter,Storr HP.Incremental fuzzy decision trees.In:Proceedings of t he 25th Annual German Conference on Artificial Intelligence,2002