基于数据挖掘技术的电子邮件地址聚类系统设计与实现

英文题名：Design and Implementation on Email Address Clustering System Based on Data Mining Technology
作者：张丹
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 电子邮件 ; 聚类 ; 密度
英文关键词：Data Mining ; Email ; Clustering ; Density
学位年度：2007
导师：黄永忠
学科代码：081202
学位授予单位：解放军信息工程大学
论文提交日期：2007-04-15

摘要

目前流行的电子邮件信息处理方法大部分只是针对单个电子邮件内容进行分析筛选,但仅仅凭借电子邮件本身内容无法实现高精确度的分类。如何利用目前各种成熟的数据挖掘技术,从海量电子邮件信息中挖掘出有用的知识和信息,成为了亟待解决的热点问题。
     数据挖掘中的聚类分析技术是数据挖掘领域一个重要研究方向,其作用是将样本数据区分为若干个类或簇,在同一个类或簇中样本之间具有较高的相似度,而不同类或簇中样本差别较大。
     本文描述了一个基于数据挖掘技术的电子邮件地址聚类系统。系统根据电子邮件地址之间的收发关系,构建出电子邮件地址的相似度测量属性,利用基于密度聚类方法中的DBSCAN算法,对电子邮件地址关系紧密程度进行划分,找出较为活跃的电子邮件地址,从而缩小了电子邮件地址查阅范围,提高电子邮件信息分析处理的针对性和有效性。在电子邮件信息抽取过程中,系统实现了海量电子邮件信息解码和属性分类存储。在不影响数据原有特征的前提下,通过去重、填补、剪枝和遍历查找的方法,对电子邮件信息进行预处理,最大限度的缩减了数据规模,解决了处理海量信息时的速度问题。另外,系统使用了特定地址邮件收发数量统计和特定地址联系状况统计的两种统计方法,为分析数据规律,了解数据概貌提供了一种直观的方法,同时也为验证电子邮件地址聚类结果有效性提供了参考。
     最后,本文还对开发的系统进行了验证分析。验证结果表明,系统在保证较快运行速度的前提下,达到了对电子邮件地址关系紧密程度的划分和电子邮件地址信息统计结果可视化表示的设计目标。验证了系统的有效性。
Now,the popular disposal methods of Email information mostly focus on analyzing and filtering of single Email content.But it's impossible to achieve classification Email by ruler and line just based on content. So how to use all kinds of successful technologies of data mining to find out valuable information from huge Email data becomes a problem that urgently to be resolved.
     The cluster analysis is the one of the important research of Data Mining.The function of cluster analysis is to group a set of physical or abstract objects into classes of similar objects.A cluster is a collection of data objects that similar to one another within the same cluster and are dissimilar to the objects in other clusters.
     This paper brings forward an Email address cluster system based on data mining technology.According to the receiving and sending's contact of Email addresses,system creates Email address's attribute of similarity measure,then use DBSCAN algorithm ,which is the one of density-based clustering methods,to classify Email by degree of Email address's contact, and find out the active Email addresses.The process minish the scope of Email address that should be examined.The pertinence and validity of Email analysis were improved.The process of extracting Email information implements information decoding and attribute storage by classes.By removing repetitive records,filling up blank records,eliminating superfluous records and traveling data sets, Email data is pretreated.The process of pretreating furthest curtails the data quantity.So it resolves the problem of time when disposing huge information.And the process does not destroy the data intrinsic charaters.Furthermore,by using the statistics of the Email's receiving and sending's quantity and the statistics of the Email contact status of given Email address,system can display visually the communication status of given Email address.It provides an intuitionistic means to analyse the rule of data and find out the survey of data.At the same time,it also provides the reference to validate the results aquired by clustering.
     Finally,the paper validates and analyses the system.The results of tests show that this system can run at an ideal speed,attain the goal of design to classify Email by degree of Email address's contact and display visually the results of the statistics of Email's information.The results also validate system's validity.

引文

[1]Han J.W,Kamber.M.Data Mining;Concepts and Techniques[M].Morgan Kaufmann Publishers Inc.San Francisco,CA,2000;4-5,70-87,223-260.
    [2]Christian Bird,Alex Gourley,Prem Devanbu,Michael Gertz,Anand Swaminathan.Mining Email Social Networks[A].In;ICSE.Proceedings of the 2006 International Workshop on Mining Software Repositories[C],shanghai,China,2006,137-143.
    [3]Tyler J.R,Wilkinson D.M,Huberman B.A.Email as spectroscopy;Automated discovery of community structure within organizations[J].The Infoemation Society,2005,21(2);143-153.
    [4]张丹,黄永忠,刘晓楠.基于朴素贝叶斯分类器的虚拟树结构挖掘[J].计算机科学(增刊),2006,33(10);205-207.
    [5]Marshall van Alstyne,Jun Zhang.EmailNet;A System for Automatically Mining Social Networks from Organizational Email Communication[R].Pittsburgh;In NAACSOS2003,2003.
    [6]唐常杰,刘成,温粉莲,乔少杰.社会网络分析和社团信息挖掘的三项探索-挖掘虚拟社团的结构、核心和通信行为[J].计算机应用,2006,26(9);2020-2023.
    [7]Dzeroski S.Multi-reletional data mining;an introduction[J].ACM SIGKDD Explorations Newsletter,2003,5(1);1-16.
    [8]Ferrman L.C.Graphic techniques for exploring social network data[A].In;Carrington P.J,Scott J,Wasserman S(eds.).Models and Methods in Social Network Analysis[M].Cambridge University Press,2005,248-269.
    [9]Adam J.O'Donnell,Walter C.Mankows.ki,JeffAbrahamson.Using Email social Network Analysis for Detecting Unauthorized Accounts[A].In;CEAS.Proceedings of the Third Conference on Email and AntiSpam[C].Mountain View,California USA.2006;152-156.
    [10]Jennifer Golbeck,James Hendler.Reputation Network Analysis for Email Filtering[EB/OL].;University of Maryland,College Park,http;//citeseer.ist.psu.edu/cs,2005.
    [11]Aron Culotta,Ron Bekkerman,Andrew McCallum.Extracting social networks and contact information from email and the Web[EB/OL].Department of Computer Science,University of Massachusetts,Amherst,MA 01003,USA,http;//citeseer.ist.psu.edu/cs,2004.
    [12]乔少杰,唐常杰,等.基于属性筛选支持向量机挖掘虚拟社团结构[J].计算机科学(增刊),2005,32(7);208-212.
    [13]温粉莲,唐常杰,等.挖掘被监控社团核心的最短路径方法[EB/OL].;中国科技论文在线,http;//www.paper.edu.cn,2006-7-4.
    [14]温粉莲,唐常杰,等.基于社会网络最短路径挖掘犯罪集团核心[J].计算机科学(增刊),2006,33(11);266-268.
    [15]刘威,唐常杰,等.基于发信者属性的监控社团通信行为方法[EB/OL].;中国科技论文在线,http;//www.paper.edu.cn,2006-7-3.
    [16]XIE De-Ping.A Study Report for Mining Email.Journal of Software,2006,15(9);1200-1210.
    [17]胡向前.基于FP-Tree的多层关联规则挖掘算法研究[D].重庆;重庆大学,2005.
    [18]David Hand等,张银奎等译,数据挖掘原理[M].北京;机械工业出版社,2003;1-3.
    [19]M.S.Chen,J.H.Han,ES.Yu.Data mining;an overview from a database perspective[EB/OL].;IEEE Trans.KDE,8(6),http;//citeseer.ist.psu.edu/cs,1996.
    [20]R.A grawal,J.Gehrke,D.Gunopulos,and P.Raghavan.Automatic subspace clustering of high dimensional data for data mining applications[A].In;ACM.Proceedings of the ACM SIGMOD International Conference on Management of Data[C].Seattle;ACM Press,1998;73-84.
    [21]J.Vitter.Random sampling with reservoir[J].ACM Transactions on Mathematical Software,1985,11(1);37-57.
    [22]T.Zhang,R.Ramakrishnan,and M.Livny.BIRCH;An efficient data clustering method for very large databases[A].In;H.V.Jagadish,Inderpal Singh Mumick(Eds.).Proceedings of the ACM SIGMOD International Conference on Management of Data[C].Montreal,Canada;ACM Press,1996;103-114.
    [23]W.Zhang,J.Yang,and R.Muntz.STING;A statistical information grid app roach to spatial data mining.[C]In;Matthias Jarke,Michael J.Carey,Klaus R.Dittrich,Frederick H.Lochovsky,Pericles Loucopoulos,Manfred A.Jeusfeld(Eds.);Proceedings of the 23rd VLDB Conference,Athens,Greece,1997;186-195.
    [24]边肇祺,张学工.模式识别(第二版)聚类技术[M].北京;清华大学出版社,1999;235-247.
    [25]薛冰冰,普杰信.数据挖掘技术及其在电子邮件中的应用[J].信息技术,2003,27(7);4-5.
    [26]RFC822.Standard for the format of ARPA Internet text message[S].STDll.UDEL,Crocker D,1982.
    [27]RFC821.Simple mail transfer protocol[S].STD 10.USC/Information Sciences Institute,Postel j,1982.
    [28]RFC 1725.Post office protocol-version3[S].Myers j.Dover Beach Consulting.1994.
    [29]曹麒麟,张千里.垃圾邮件与反垃圾邮件技术[M].北京;人民邮电出版社,2003;3-10.
    [30]RFC1521.MIME(Multipurpose Internet Mail Extentions)part one;mechanisms for specifying and describing the format of Internet message bodies.Borenstein N,Freed N.Bellcore,Innosoft,1993.
    [31]RFC1522.MIME(Multipurpose Internet Mail Extentions)part two;message header extentions for non-ASC Ⅱ text[S].USA;Moore K,University of Tennessee,1993.
    [32]谢希仁,计算机网络[M].北京;电子工业出版社,1999;293-305.
    [33]严蔚敏,吴伟民.数据结构;C语言版[M].北京;清华大学出版社,1996;169-170.
    [34]Kok S,Domingos P.Leaming the structure of Markov logic networks[A].In;ICML,Proceedings of the 22nd Intemational Conference on Machine Learning[C].Bonn,Germany,2005,441-448.
    [35]刘明吉,王秀峰,黄亚楼.数据挖掘中的数据预处理[J].计算机科学,2000,27(4);54-57.
    [36]梁力,严建伟,聂影.基于源地址约束的垃圾邮件过滤模型[J].西安交通大学学报,2005,39(4);376-379.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700