BW-LVQ邮件过滤模型

作者：王影
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：垃圾邮件 ; 学习矢量量化 ; 黑名单 ; 白名单 ; 过滤模型
英文关键词：Spam ; Learning Vector Quantization ; Black List ; White List ; Filtering Model II
学位年度：2005
导师：卢显良
学科代码：081201
学位授予单位：电子科技大学
论文提交日期：2004-12-01

摘要

伴随着Internet 的普及,电子邮件以其快捷、方便、低成本的特点日益得到了广泛的使用,成为了最流行使用的沟通工具之一。然而,作为其发展的副产品――垃圾邮件,却给Internet 用户、网络管理员和网络服务提供商ISP 带来了无尽的烦恼,收件人的时间、带宽和存储资源被无效占用,网络链路因此造成拥塞,还被作为不良信息的载体被到处散发。现在成熟应用的垃圾邮件过滤方法是采用通过软件自动过滤与人工管理相结合的方式,但这不能很好的适应垃圾邮件的多样性,只能过滤掉50%左右的垃圾邮件。因此,迫切需要引入更加智能化的垃圾邮件过滤技术来治理日益猖獗的垃圾邮件问题。
    本论文课题的主要目标是探索一种具体的垃圾邮件过滤模型,实现并测试该模型。研究中要观察所选择的模型是否适当,注意此模型自身参数和环境参数调节对过滤性能的影响,因此,实验需要能够彻底的检测出模型的有效性和可行性。作者在课题研究期间很好的完成了上述目标。
    本论文提出了LVQ 邮件过滤模型和改进型BW 邮件过滤模型,详细的描述了两个模型的设计原理,讨论了两者之间的关系以及它们与邮件服务器的关系,并给出了重要的实现框架与代码。LVQ 邮件过滤模型解决了布尔型邮件过滤模型特征项离散、垃圾邮件与正常邮件边界定义模糊的问题;改进型BW 邮件过滤模型针对传统黑白名单模型提出了改进,减少了用户对边界地址错误界定带来的损失。
    虽然当前已经存在多种多样的垃圾邮件过滤方法,但是还有许多垃圾邮件相关问题没有找到好的解决办法,这大大的影响了邮件过滤系统的过滤性能,使得垃圾邮件的危害没有减轻。本论文提出的新的邮件过滤模型解决了其中的一些问题,在一定环境下能够提高邮件过滤系统的过滤性能,因此,本课题的研究是具有意义的。
As the popularization of Internet, e-mails are more and more frequently used,benefiting from its high efficiency, convenience and low cost. At the same time,however, their byproduct, spams are bringing endless trouble to Internet users,network administrators and Internet service providers. With the spreading of thesecarriers for bad or useless information, the users’time is wasted, the bandwidth andstorage space are consumed, and even the Internet is congested. The mature spamfiltering methods used now combine both the automatic filtering of the software andmanual management, which has been proved not adaptive to variety of spams, and itis estimated only 50% of spams can be detected. Therefore, more intelligent filteringtechniques are required.
    This main goal of this paper is to explore a specific spam filtering model,implement and test it. During our research, we need to examine carefully whether themodel is a fit one, and observe how the parameters of this model itself and theenvironmental parameters influence the filtering performance. So, the test shouldreveal the feasibility and efficiency of this model thoroughly. The author has achievedthe goal above.
    This paper put forwards two filtering model, Learning VectorQuantization(LVQ)and improved Black&White List(BW), described the designprinciples, discussed their inter-relationship and their relationship with the mail server,and provided important implementation framework and codes. LVQ model solved thediscretion of eigenitem in the Boolean filtering model and the difficulties indistinguishing between spams and normal mails. Improved BW model madeimprovements over the traditional black list and white list model, and decreased theusers’loss due to incorrect bordering address.
    Though varieties of filtering methods exist now, a large number of problemsabout spam filtering still remain to be solved, which impedes the filteringperformance from improving. New filtering models brought forward in this paper hasprovided some solutions. It has been proved that they can improve the filteringperformance in some environment. Therefore, the research is of great value.

引文

[1] 曹麒麟,张千里.垃圾邮件与反垃圾邮件.人民邮电出版社,2003.2.
    [2] William W. Cohen. Learning rules that classify email. In proceedings of the 1996 AAAI Spring symposium in information access,1996.
    [3] Sahami,M,S.Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail.Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin,1998.
    [4] Androutsopoulos I.,J.Koutsias,et al. An experimental comparison of Na?ve Bayesian and Keyword-based anti-spam filtering which encrypted personal messages. Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval,Athens,Greece.
    [5] X.Carreras and L.Mrquez.Boosting trees for anti-spam email filtering. In Proceedings of RANLP-01, Jth International Conference on Recent Advances in Natural Language Processing, Tzigov Chark,BG,2001.
    [6] Duhong chen, Tongjie et al. Spam Email Filter Using Na?ve Bayesian, Decision Tree, Neural Network and AdaBoost, http://www.cs.iastate.edu/~tongjie/spamfilter/pater.pdf
    [7] James Clark, Irna Koprinska, Josiah Poon, A neural network based approach to automated e-mail classification, Proceedings of the IEEE/WIC international conference on web intelligence.
    [8] 王树林,吴仲贤等.人工智能词典.人民邮电出版社,1992.2.
    [9] 尼尔森著,郑扣根译.人工智能.机械工业出版社,2000.1.
    [10] Androutsopoulos,I.,Koutsias,J.,Chandrinos,K.V.,Paliouras,G.,Spyropoulos ,C.D.,An evaluation of na?ve bayesian anti-spam filtering, Proceedings of the Workshop on Machine Learing in the New Information Age,2000.
    [11] Baeza-Yates,R.& Ribeiro-Neto,B.,Modern Information Retrieval,Addison Wesley,1999
    [12] Belkin,N.,J.&Croft,W.,B.,Information filtering and information retrieval: two sides of the same coin?,Communication of the ACM,1992.
    [13] Carreras,X.& Marquez,L.,Boosting trees for anti-spam eamil filtering,Proceeding of the 3rd Conference on Recent Advances in NLP,RANLP,2001.
    [14] Chang,Y.,E-mail filtering: machine learning techniques and an implementation for the Unix Pine mail system,Fianl Project, Massachusetts Institute of Technology.
    [15] Cranor,L.,F.& LaMacchia, B.,A.,Spam!, Communications of the ACM,1998.
    [16] Drucher, H., Wu, D., Vapnik, V., N., Support vector machines for spam categorisation, IEEE transactions on Neural Netwrok, 1999.
    [17] Heaps, H., S., Information retrieval-computational and theoretical acpects, Academic Press, 1978.
    [18] Joachims, T.& Sebastiani, F., Guest editors introduction to the special issue on automated text categorisation, Journal of Intelligent Information System, 2002.
    [19] Keen, E., M., Performance comparisons of boolean and ranked output retrieval, TmcEnery and Cpaice, 14th Information Retrieval Colloquium, Lancaster, Britisb Computer Society, 1992.
    [20] Levin, S., E-mail filtering tool, BSc dissertation, University of Sheffield,2001.
    [21] Lewis, D., D., Evaluating and optimising autonomous text classification system, Proceedings of the Nineteenth Annual International conference on Research and Development in information Retrieval, 1995.
    [22] Lewis, D., D., Schapire, R., E., Callan, J., P., Papka, R., Training algorithms for linear text classifiers, Proceedings of the Eighteenth Annual International conference on Research and Development in Information Retrieval, 1996.
    [23] Lian, Y., E-mail filtering, MSc dissertation, University of Sheffield, 2002.
    [24] Manning, C., D. & Schutze, H., Foundations of statistical natural language processing, The MIT Press, 1999.
    [25] Oard, D., W. & Marchionini, G., A conceptual framework for text filtering, University of maryland, 1996.
    [26] Qui, Y., ASIR: An integrated system for information retrieval, 14th Information Retrieval Colloquium, Lancaster, Britisb Computer Society, 1992.
    [27] Rennie, J., D., M., An application of machine learning to email filtering, Proceeding of the KDD-2000 Workshop on Text Mining, 2000.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700