基于Netfilter的垃圾邮件过滤网关的设计和实现

作者：张益
论文级别：硕士
学科专业名称：计算机应用
中文关键词：垃圾邮件 ; 邮件过滤 ; 透明网关 ; Netfilter
英文关键词：Spam ; trash mail ; Spam filtering ; transparent networks gateway ; Netfilter
学位年度：2006
导师：秦志光
学科代码：081203
学位授予单位：电子科技大学
论文提交日期：2006-05-01

摘要

随着Internet应用的迅猛发展,电子邮件得到了越来越广泛的应用,给人们工作和生活带来了巨大便利。与此同时,大量的商业、社会和政治等垃圾邮件日益成为电子邮件使用者所面临的头痛问题。如何有效地过滤掉各种垃圾邮件已经成为众多研究者所关注的课题。
     目前绝大部分的邮件过滤方式都可大致分为两类:邮件客户端过滤和邮件服务器端过滤。而随着如今邮件流量和用户数量的迅猛增长,这种在集成在邮件服务器上过滤系统越来越暴露出它的缺点:对服务器资源的消耗大,影响了正常邮件服务。对于一个小型的邮件服务器(如公司内部使用的),现今的过滤手段又显得配置复杂,大材小用。
     本课题的来源是华为公司高校基金项目“基于P2P模式的垃圾邮件过滤网关”。本文首先介绍了一些传统邮件过滤技术,如实时黑白名单、反向DNS查询、贝叶斯过滤、基于规则的过滤等等,并总结了它们的特色和不足之处。在此基础上提出了一个基于Netfilter架构的垃圾邮件过滤网关模型,其基本特点是:过滤网关同邮件服务器分离,对用户透明存在,配置简单。该模型综合运用了多种技术,可划分为以下七个模块:数据包重定向模块,协议分析模块,攻击防护模块,邮件头分析模块,规则过滤模块,贝叶斯过滤模块,查毒模块,查询管理模块。其中前两者又可称作邮件截获引擎子系统。
     接着详细介绍了作者的主要工作:实现透明网关的邮件截获引擎,包括透明网关的搭建、数据包的捕获、邮件的还原、以及一个双缓冲可重入邮件队列的实现,并为垃圾邮件判定子系统提供了接口。在具体实现中,使用了Netfilter/Iptables系统,设计了“扣留最后一个包”算法实现数据包的重定向,模拟了一个专门针对SMTP协议的精简协议栈,以及在邮件队列中使用了双缓冲和多线程技术。然后相对简略介绍了垃圾邮件判定子系统的设计实现原理:包括邮件头分析、贝叶斯过滤和基于规则的过滤。最后给出了系统的测试方法和结果,并给出了在实验室内网中测试的结果,证明了该系统方案设计是合理可行的,并对以后的改进和扩展提出了建议。
With the boom of Internet, Email is put into use widely for the great convenience of our work and life. At the same time, more and more Spam(or trash mail) became a big trouble for the Email users. How to filter Spams about all the types effectively is now a hotspot problem for many researchers.
     There are two ways for filtering Spam mostly, Client Filtering and Server Filtering. Present Spam-filtering systems using Server Filtering technique have their flaws. The first is that too much server resources are consumed by the Spam-filtering, which influence the common Email service. Secondly, they are complicated to configure and use in a smaller network, for example, a company Intranet.
     In the first part of this article, some traditional Spam-filtering techniques, including Real-time Black List, Reverse DNS Requirement, Bayes Filtering, Filtering based on rules, etc. are introduced. Through analyzing their features and deficiency, a Spam-filtering model based on Netfilter frameworks is proposed. It has features as separating from Email server, transparent for Email users and easy to configure and use. It consists of seven modules as following: IP packet re-direction, protocol analyse, attack protection, mail head analyse, Bayes filtering, filtering based on rules, virus check and administrator requirement. The first two of these are also called Email capturing engine.
     Consequent content of this article is about how to design and implement a high-speed Email capturing engine which is also the main course of my work. It includes constructing a transparent networks gateway, capturing IP packets, restoring SMTP Emails, and the implementation of a re-entered mail queue with two buffers. Afterwards, a relatively brief introduction to Spam Judge Subsystem is given, including mail head analyse, Bayes filtering and filtering based on rules.
     At the end, approaches to test this Spam-filtering system are illuminated and some successful tests are done in our lab’s Intranet. An advice to improve it is also mentioned.

引文

[1] 中国互联网络信息中心. 第十三次中国互联网发展状况统计报告. 2004.1
    [2] Tomas Brown. An introduction to the history of Spam. http://www-900.ibm.com/developerWorks/cn/linux/other/spamf/index_eng.shtml .September,1999
    [3] 中国互联网协会. 中国互联网协会反垃圾邮件规范. 2003.3
    [4] 中国互联网协会反垃圾邮件协调小组. 反垃圾邮件形势年度报告. 2006.1
    [5] 金羊网. 中国垃圾邮件形势概况. http:/www.ycwb.com
    [6] D.Mertz.Six approaches to eliminating unwanted e-mail. http://www-900.ibm.com/developerWorks/cn/linux/other/spamf/index_eng.shtml September,1999
    [7] G.Sakkis,I.Androutsopoulos,G.Paliouras,V.Karkaletsis,C.D.Spyropoulos,and P.Stamatopoulos.A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval[J].Vo1.6,No.1
    [8] 王斌,潘文锋. 基于内容的垃圾邮件过滤技术综述. 中文信息学报. 2005, Vol.19, No.5
    [9] 王晓勇. 邮件过滤技术. http://anti-spam.org.cn/
    [10] 叶豪. 反垃圾邮件技术综述. 中国反垃圾邮件计算研讨会 2004 年会. 2004
    [11] 王晓勇. 实时黑名单技术介绍. http://anti-spam.org.cn/
    [12] 王晓勇. 白名单技术原理. http://anti-spam.org.cn/
    [13] Paul Graham. A plan for Spam. http://www.paulgraham.com/plan.html
    [14] Paul Graham. Better Bayesian Filtering. http://www.paulgraham.com/better.html
    [15] 戴劲松,白英彩. 基于贝叶斯理论的垃圾邮件过滤技术. 计算机应用与软件,2005 年 Vol.23,No.1
    [16] H.Katirai. Filtering Junk E-Mail:A Performance Comparison between Genetic Programmmin Naive Bayes. http://members.rogers.condhoomank/katirai99fihering.pdf September,2000.
    [17] I.Androutsopoulos,J.Koutsias,etc. An Evaluation of Naive Bayesian Anti-Spam Filtering. Proc.4th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2O00). Sep.2000
    [18]I.Androutsopoulos,G.Paliouras and E.Miehelakis. Learning to Filter Unsolicited Commercial E-Mail[EB].Technical report 2004/2. NCSR "Demokritos". Jun 2004.
    [19] 李渝勤,孙丽华.基于规则的自动分类在文本分类中的应用.中文信息学报,2004 年第4 期
    [20] 电子科技大学网络安全实验室反垃圾邮件项目组. 基于 P2P 模式的垃圾邮件过滤网关需求说明书. 2005 年 9 月
    [21] 甘迎辉,刘勇,秦志光. netfilter 技术分析及在入侵检测中的应用. 电子科技大学研究生学报 2003,No.17. 2003 年 5 月
    [22] 张金良,李忠言. 新一代 netfilter 底层开发结构. 现代情报. 2005, Vol.25 No.9. 2005年 10 月
    [23] http://www.iptables.org
    [24] http://www.iptables.org/ip_queue.html
    [25] Richard Stevens. TCP/IP 详解,卷 1:协议. 机械工业出版社. 2000 年 4 月
    [26] Richard Stevens. TCP/IP 详解,卷 2:实现. 机械工业出版社. 2000 年 4 月
    [27] Richard Stevens. Unix 环境高级编程. 机械工业出版社. 1999 年 5 月
    [28] Richard Stevens. Unix 网络编程(第 1 卷). 清华大学出版社. 2002 年 9 月
    [29] Richard Stevens. Unix 网络编程(第 2 卷). 清华大学出版社. 2002 年 9 月
    [30] IETF. RFC 0821
    [31] IETF. RFC 0345
    [32] IETF. RFC 0409
    [33] William S.Yerazunis.The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. Presented at the 2004 MIT Spam Conference. January 18,2004.
    [34] 余战秋. 中文分词技术及其应用初探. 计算机研究与开发. 2002 年 1 月
    [35] 刘斌,黄铁军,程军.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002年第 6 期
    [36] 刘步权,廖湘科,吴庆波. Perl 程序设计语言综述. 计算机工程与应用. 2002 年第 18 期
    [37] Larry Wall 等著. Perl 语言编程(第三版). 机械工业出版社. 2001 年 12 月

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700