基于RS理论和SVM的网络信息过滤技术的研究

英文题名：Research on the Filtering Technology of Internet Information Based on RS Theory and SVM
作者：刘杨
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：RS理论 ; 二叉树多分类SVM ; Web信息过滤 ; 电子邮件过滤
英文关键词：RS theory ; Binary tree Multiclass SVM ; the filtering of Web information ; Email filtering
学位年度：2008
导师：衣治安
学科代码：081203
学位授予单位：大庆石油学院
论文提交日期：2008-03-26

摘要

随着互联网的飞速发展,人们获取了丰富的信息。然而,各种不良信息也随之泛滥,特别是反动、色情、暴力等有害信息极大地危害着社会的稳定和人们的身心健康,网络“垃圾”已经侵入了我们的生活。如何过滤掉与自己需求无关的信息,如何快速、准确的获得所需信息并免受非法信息的侵扰,已经成为当前互联网发展研究的热点。
     本文提出了一种新的将RS理论和二叉树多分类SVM算法相结合的网络信息过滤思想,通过改进的启发式相对属性约简和值约简,消除冗余属性和值,对变换后的数据表,采用一种带松弛因子的统计粗糙集算法生成决策规则,使挖掘出的规则更简洁,具有更高的可靠性,可以有效地避免生成规则的偶然性,从而降低误分类率。然后通过二叉树多分类SVM算法来训练SVM,将多分类转化为二值分类,算法采用先聚类再分类的思想,计算测试样本与子类中心的最大相似度和子类间的分离度,以构造决策结点的最优分类超平面。对于C类分类只需C ?1个决策函数,从而可节省训练时间。实验表明:RS理论和二叉树多分类SVM相结合的算法,可以降低训练模型的复杂度,从而在一定程度上减少了模型的过拟合现象,并提高了SVM的推广能力和训练速度,取得了较好的过滤效果。
     本文实现了一个位于邮件客户端,能对已有邮件进行学习,自动对新到邮件进行分类过滤的智能邮件过滤系统。该系统是基于POP3协议和SMTP协议,介于用户的邮件服务器和邮件接收软件之间的一个过滤层。系统中邮件的过滤分成两级实现:第一级是在邮件取下后,首先根据邮件信头内容进行过滤,进行邮件分解、内容分析、特征提取,并形成特征向量形式。第二级过滤的主体部分是基于二叉树SVM的多分类过滤器,核函数选用径向基函数。最后用大量电子邮件进行测试,计算邮件过滤评估函数,并与Naive Bayes方法、KNN算法、Boosting Trees算法几种过滤方法相比较。实验结果表明,该系统具有实时监控、自动更新邮件过滤模块的能力,使邮件过滤更高效、更准确。
     在电子邮件过滤中,由于垃圾邮件中含有的URL地址是通过授权获得的,因此,本文采用了基于URL地址进行垃圾邮件过滤的方法,通过捕获垃圾邮件中所含有的URL信息,这种方法对过滤含有URL的垃圾邮件相当快速、有效,是其它过滤方法难以做到的。
With the rapid development of Internet,people acquire abundant information.However, many kinds of illegal information is also flooding,especially the reactionary,pornography, violence information is harming the society’s stable and people's physical and moral integrity enormously,the network trash has already invaded our lives.How to filter the information which has nothing to do with ourselves’demands,How to obtain the information which we are needing more fast and more accurate,and exempt the invasion of the illegal information, the technology of Network information filtering has already became the researching hotspot in the Internet development field at present.
     This paper proposes an improved idea of data classification and filtering based on Rough Set theory and Binary tree SVM,utilizes an improved heuristic algorithm of related attribute reduction to eliminate conflicting data,reduces space dimension of sample data,For the transformed data table,it presents a kind of relaxation factor algorithm based on statistical rough sets model to make decision rule.It can avoid generating the casual rules,make the mined rules more simply,depress the mistake classified rate.then It trains SVM by clustering integrated with Binary tree SVM,it can convert multiclass problem to binary classification problem by constructing binary tree.Algorithm adopts the idea of clustering first and classifiying later,calculates the most similarity between testing sample and sub-category center,and the separation measure of sub-categories,in oder to construct the optimal class hyperplane of decision-making nodes.It only needs C-1 kinds of optimal function for C kinds of classification,so it can save training time.The experiment results show that the new algorithm can decrease the complexity in the process of SVM classification, prevent the over-fit of training model at a certain extent, can improve the training speed and precision of filtering.
     This paper implements an intelligent mail filtering system, which is located on the side of mail client,it is able to study the older mail,carry on classifying and filtering to the newly mail automatically.The systerm has a filtering floor which is based on the agreement of POP3 and the agreement of SMTP ,and it is situated between the mail-server and the mail received software.the mail filtering is divided into two levels of realizations in the system:The first level is to filter the content of mail header after the mail is taken down, carry on the mail to decompose,analysis the mail’s content,extract the Characteristics,and form the characteristic vector form.The main part of the second level of is multiclass filter which is based on the binary tree.Its nuclear function selects the radial direction primary function.Finally it tests the Effect Through massive emails experiment,it calculates the appraisal function of mail filtering, and compares with the several filtering methods of Naive the Bayes,the KNN algorithm,the Boosting trees.The experiment results show that the systerm has the capacity of real-time monitoring,the ability to Update module of filtering e-mail automatically,and makes the Email filtering to be more highly effective,more accurate.
     In view of the URL address in the junk mail is obtained through authorizing, So this paper adopts the method of filtering the junk mail is based on URL address,By capturing the URL information in the junk mail,the methods can filter the junk mail which contains URL address more faster, more effective,It is difficulty to achieve for other filtering methods.

引文

[1] Uri Hanani.Bracha Shapira and Peretz Shoval Information Filtering:Overview of Issues,Research and Systems[J].User Modeling and User - Adapted Interaction 2001.11(3):203- 259.
    [2] H.P.Luhn.A business intelligence system[J].IBM Journal of Research and Developm- ent.1958.2(4):314- 319.
    [3] Edward.MHousman.Survey of current systems for selectivedissemination of information [R].Technical Report SIG/SDI-1.American Society for Information Science Special Interset Group on SDI.Washington DC.June 1969.
    [4] Peter J.Denning.Electronic junk [J].Communications of the ACM.1982.25(3)163-165.
    [5] Thomas W.Malone,Kenneth R.Grant, Franklyn A.Turbaketal.Intelligent information sharing systems [J].Communications of the ACM.1987:390- 402.
    [6] Belkin N.J.and Croft WB.Information Filtering and information Retrieval:Two Sides of the same coin[J].Communication of ACM.1992.35(12):29- 38.
    [7] Yang Y,Chute Cq.An Example-based Mapping Method[J].In ACMTransation on Information Systems.1994.(7).
    [8] E.Voorhees and D.K.Harman.Overview of the ninth text retrieval conference [R].The Ninth Text Retrieval Conference.
    [9] S.Robertson and I.Soboro.The TREC-10 Filtering Track Final Report [R].In The Tenth Text Retrieval Conference(TREC-10)2001.
    [10] Stephen Robertson Ian Soborof.The TREC 2002 Filtering Track Report[R].In The Eleventh Text Retrieval Conference(TREC-11)2002.
    [11] Pawlak Z.Rough Set s [J].International Journal and Computer Sciences.1982(2) 341-356.
    [12] Zhang Lian-hua,Zhang Guan-hua,Yu Lang,et.al.Intrusion detection using rough set classification[J].J.Zhejiang Univ SCI 2004 5(9):1076-1086.
    [13]励晓健,黄勇,黄厚宽.基于Poission过程和Rough包含的计算免疫模型.计算机学报[J].2003.26(1):71-76.
    [14]蔡忠闽,管晓宏,邵萍等.基于粗糙集理论的入侵检测新方法.计算机学报[J].2003. 26(3):361-366.
    [15] RW.Swiniarski and A.Skowron.Rough set methods in feature selection and recognition [J].Pattern Recognition Letters 24(6):833-849(2003).
    [16] J.G Baaan,J.F.Peters and A.Skowron,et al.Rough Set Approach to Pattern Extraction from Classifiers [J].Electr.Notes Theor.Comput.Sci.82(4):(2003).
    [17] Z.Pawlak.Rough set-theoretical aspects of reasoning about data [M].Boston.MA: Kluwer Academic Publishers.1991.
    [18] Z.Pawlak:Reasoning about Data-A Rough Set Perspective [J].Rough Sets and Current Trends in Computing 1998:25-34.
    [19] Dempster,A.P.A generalization of Bayesian inference [J].Journal of the Royal Statistical Society.Series B 30 205-247.1968.
    [20] Z.Pawlak.Rough Sets and Decision Algorithms [J].Rough Sets and Current Trends in Computing 2000:30-45.
    [21]张义荣,王国玉.基于机器学习的入侵检测技术研究.[博士学位论文].国防科学技术大学.
    [22] Nguyen H S,Skowron A.Quantization of Real Values Attributes.Rough Set and Boolean Reasoning Approaches [C]//Proceeding of the 2nd Joint Annual Conference on Information Science.Wright sville Beach.NC.USA.1995:34237.
    [23] WONG S KM,ZIAKO W.On optimal decision rules tables[J].Bulletion of the polish acadeny of sciences mathematics.1985.33:9210.
    [24]蔡忠闽,管晓宏,邵萍等.基于粗糙集理论的入侵检测新方法.计算机学报[J]. 2003. 26(3):361-366.
    [25]梁循.数据挖掘算法与应用.北京:北京大学出版社.2006.2.
    [26]苗夺谦,胡桂荣.知识约简的一种启发式算法[J].计算机研究与发展.1999.36 (6):81-683
    [27] VAPNIK VN.The nature of statistical learning theory[M].New York:Springer-Verlag, 2000.17-180.
    [28] [加] Jiawei Han,Micheline Kamber.数据挖掘概念与技术(英文版.第二版).北京:机械工业出版社.2006.4.
    [29]范听炜.支持向量机算法的研究与应用[D].博士学位论文.杭州:浙江大学,2003.
    [30] B.E.Boser,I.Guyon,and V N.Vapnik,A Training Algorithm for Optimal Margin Classifiers,in Proc.of the 5-th Workshop of computational learning theory,Morgan Kaufman,S.Mateo,CA,1992,pp.144-153.
    [31]粟塔山等编著.最优化计算原理与算法程序设计[M].长沙:国防科技大学出版社, 2001.
    [32] C.-C.Chang and C.-J.Lin.Training nu-Support Vector Classifiers:Theory and Algorithms [J].Neural Computation 13(9).2001.2119-2147.
    [33] KNERR S,PERSONNAZ L,DREYFUS G[A] Single-layer learning revisited:stepwise procedure for building and training a neural network[C].In Fogelman ed.Neurocompu- ting:Algorithms.Architectures and Applications.New York:Springer- Verlag,1990.
    [34] BOTTOUL L,CORTES CENKER JET AL[A].Comparison of classifier method A case study in handwrittern digit recognition[C].In:Proc of Internationl Conference on Pattern Recognition.1994.77-87.
    [35] Herlocker.J Konstan,A Botcherseta1,An algorithmic framework for performing collabo rative filtering[C].In:Proceedings of the 1999 Conference on Research and Develop- ment in Information Retrieval.1999.
    [36] M.Rosel.Improving Clustering of Swedish Newspaper Articles using Stemming.2003.
    [37] Compound Splitting.NoDaLiDa 2003.Reykjavik.Dash M.Liu H.Scheuermann Petc. Fast hierarchical clustering and its validation.Data & Knowledge Engineering.2003.
    [38]孟媛媛,刘希玉.一种新的基于二叉树的SVM多分类方法[J] .计算机应用.2005. 11.2653-2657.
    [39]刘志刚,李德仁,秦前清等.支持向量机在多分类问题中的推广[J].计算机工程与应用.2004.07.10-13.
    [40] F.Abbattista.M.Degemmis,N.Fanizzi,O.Licchelli,P.L.opes,G.Semeraro, F.Zambetta Learning User Profiles for Content-Based Filtering in e-Commerce.Proceed -ings of the AICA Annual Conference.471-480.Conversano.Italy.September 25-27. 2002.
    [41]薛欣,贺国平.基于SVM决策树判别测试点类别的新方法[J] .计算机应用.2007. 01-0084-02.
    [42]梁勇勇.基于数据挖掘的WEB内容过滤系统模型.今日科技.2006.4.
    [43] Soborof I M,Nicholas C K.Related.but not Relevant:Content—Based Collaborative Filtering in TREC 8[J].Information Retrieval.2002.
    [44] B.Scholkopf,John C.Plattz.Estimating the support of a high-dimensional distributes [J] Neural Computation,2001,13(7):1443-1472.
    [45] Dong YH.Hierarchical clustering algorithm based on neighborhood-linked in large spatial databases,Rough Sets,Fuzzy Sets,Data Mining,and Granular Computing Lecture Notes in Artificial Intelligence.2003.
    [46]刘发升,杨惠.一种带松弛因子的统计粗糙集数据挖掘算法.计算机应用.1001-9081(2004)08-0061-02.
    [47] Chih-Chung Chang and Chih-Jen [EB/OL] .2001.Software available at Lin.LIBSVM:a library for support vector machines.
    [48]石军,主儒敬,王志红.基于Web数据挖掘的一种个性化方法.计算机工程与应用.2006.07.
    [49]陈安,陈宁,周龙骧.数据挖掘技术及应用.北京:科学出版社.2006.
    [50] K.Krishnakumar.Micro-genetic algorithms for stationary and non-stationary function optimization [A].SPIE Intelligence Control and Adaptive Systems,1196:289-296,1989.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700