基于EVS相似度的邮件社区划分方法研究

作者：王芳
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 邮件社区发现 ; 邻近性度量 ; EVS相似度 ; 复杂社会网络
英文关键词：data mining ; e-mail community detection ; proximity measure ; EVS ; complex social network
学位年度：2010
导师：范明
学科代码：081202
学位授予单位：郑州大学
论文提交日期：2010-05-01

摘要

近年来,复杂网络中社区结构的发现及社会关系知识的挖掘,已经成为数据挖掘领域的研究热点之一。电子邮件系统中的邮件通信网络是一种较简单的社会网络,其社区划分问题本质上可以归结为稀疏图的聚类问题。聚类方法的核心是邻近性度量,因此发掘新的更加有效的邻近性度量方法进而提高邮件社区的划分质量,对以后的垃圾邮件的识别与过滤以及大型复杂网络的研究,具有非常重要的意义。
     本文以网络社区为背景,对邮件通信网络中的社区进行了重点研究,主要工作如下：
     (1)提出了一种新的邻近性度量方法EVS,用于指导邮件社区聚类。通过学习和研究各种邻近性度量方法以及国内外复杂网络社区挖掘的相关方法,论文将邮件社区划分转化为图的聚类。首先介绍了邮件特征的向量表示形式、构建了邮件特征矩阵。在此基础上,使用变形后的极值分布函数模型拟合邮件间通信特征信息,然后在转换后的信息矩阵上构建EVS。
     (2)结合微聚类-宏聚类的技术提出了基于EVS相似度的邮件社区聚类算法,验证了EVS的有效性。本文将余弦、皮尔森等经典的相似性度量方法引入邮件社区划分中,用于进行对比分析,并且从具体邮件社区的特点来评估邮件社区的划分质量。
     (3)实验结果表明,在实际的测试数据集上,基于EVS度量的邮件社区聚类算法比基于余弦、皮尔森相似性的邮件聚类方法更加有效,更能够发现高质量的社区。
     本文的研究具有很强的实际应用价值,对垃圾邮件的识别与过滤技术的进一步发展,大型复杂社会网络的社区发掘以及一些商业应用,都有十分重要的意义。
Nowadays, the community detection in complex networks and the knowledge mining of social relations has become one of the hot spots in the area of Data Mining. Email communication network, which is a simpler social network, belongs to the clustering of a sparse graph in nature. The key to the problem of clustering is searching for effective proximity measurement between objects. Therefore, it is helpful of detecting and constructing new similarity measure to improve the quality of community partition. What we have done will be important to recognize and filter spam and do the research on complex networks.
     In contact with web community, the thesis explores the community of mail communication network in depth. The main contributions are as follows:
     (1) Propose a new proximity measurement method, EVS (Extreme Value distribution Similarity), for the email community clustering. After analyzing various kinds of similarity measures and the research of the web community both at home and broad, we transform the problem of the mail community partition as a graph clustering. This paper firstly introduces the email feature vector to construct the email feature matrix and then models the information of email features using the transformed Extreme value distribution. Based on this, we construct EVS.
     (2) To validate EVS, the thesis proposes a new mail community clustering algorithm based on EVS in combination with micro-macro clustering technique. In addition, we induce Cosine-based Similarity and Pearson Correlation Coefficient to email community partition problem. Then, based on the experimental comparison between them, we evaluate the quality of email community according to the specific characteristics of e-mail community.
     (3) The experiments show that, comparing to Cosine-based Similarity and Pearson Correlation Coefficient, the algorithm based on EVS is more competitive for detecting email community.
     This research has high practical. It is important and useful to the development of spam detecting technique, the community detection of large complex social networks and some other commercial activities.

引文

[1]Kumar, Raghavan, Rajagopalan, et al. The Web as a Graph[C]. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Pennsylvania:ACM Press,1999:109-118.
    [2]M. E. J. Newman, M. Girvan. Finding and evaluating community structure in networks[J]. Physical Review E69,026113,2004.
    [3]F. Radicchi, C. Castellano, F. Cecconi, etc al.. Defining and Identifying Communities in Networks[C]. In:Proceedings of the National Academy of Sciences[C].2004.101(9): 2658-2663.
    [4]Prof. Yanchun Zhang. Web Communities Analysis and Construction[R]. VLDB School Lecture,2005:56-92.
    [5]G. W. Flake, S. Lawrence, C. L. Giles. Efficient Identification of Web Communities[C]. Conference on Knowledge Discovery in Data, In proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining,2000:150-160.
    [6]Dom, Gibson, J. Kleinberg, et al. Automatic resource compilation by analyzing hyperlink structure and associated text[C]. In:Proceedings of the 7th ACM-WWW International Conference. Brisbane:ACM Press,1998:65-74.
    [7]Yan Gao, Shiwen Gu, Jin Tang. Research on Technologies of Discovering Web Community Based on Hyperlink Analysis [J]. Application Research of Computers,2006,23(7):183-185.
    [8]J. Dean, M. R. Henzinger. Finding related pages in the World Wide Web[J]. Computer Networks,1999,31(11-16):1467-1479.
    [9]G. Flake, R. Tarjan and K. Tsioutsiouliklis. Graph clustering and mining cut trees[J]. Internet Mathematics,2004.1(3):355-378.
    [10]Hidehiko Ino, Mineichi Kudo, Atsuyoshi Nakamura. Partitioning of Web graphs by community topology [C]. In Proceedings of the 14th International Conference on World Wide Web (Chiba, Japan, May 10-14,2005). WWW'05, ACM Press, New York, NY,661-669.
    [11]Yan Liu, QingXian Wang, Qiang Wang, Qing Yao, Yao Liu. Email Community Detection Using Artificial Ant Colony Clustering[C].2007.1-10.
    [12]Nicolas Labroche, Nicolas Monmarche, Gilles Venturini. A new clustering algorithm based on the chemical recognition system of ants[C]. In:Proceedings of 15th European Conference on Artificial Intelligence (ECAI 2002), Lyon France,2002:345-349.
    [13]M. E. J. Newman. The structure of scientific collaboration networks[C]. Proc. Natl. Acad. Sci. USA98,2001:404-409.
    [14]H. Small. Co-citation in the scientific literature:A new measure of the relationship between two documents[J]. Journal of the American Society for Information Science,1973,24(4): 265-269.
    [15]M. Girvan, M. E. J. Newman. Community structure in social and biological networks[C]. In: Proceedings of the National Academy of Sciences of the United States of America. USA, 2002,99(12):7821-7826.
    [16]Shaoyu CHEN, Jiaxing SONG, Weidong LIU. Relation Grid:Small-World based Social Relationship Network[J]. Application Research Of Computers,2006,23(5):194-197.
    [17]Shlomo Hershkop. Behavior-based Email Analysis with Application to Spam Detection[D]. Graduate School of Arts and Sciences,2006:40-77.
    [18]Fulu Li, Mo-Han Hsieh. An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies[C]. Proceedings of the 40th Annual Hawaii International Conference on System Sciences,2007.
    [19]Wen-Feng Hsiaoa, Te-Ming Changb, Guo-Hsin Hua. A Cluster-based Approach to Filtering Spam under Skewed Class Distributions[C]. Proceedings of the 40th Hawaii International Conference on System Sciences,2007.
    [20]Jiuchang Wei, Dingtao Zhao, Research on the Crisis Information Communication Model and its Impact Factors[J]. Information Science,2006,24(12):1782-1785.
    [21]D. J. Watts, S. H. Strogatz. Collective dynamics of small world network[J]. Nature,1998: 393-442.
    [22]黄萍,张许杰等.小世界网络的研究现状与展望[J].情报技术,2007,26(4)：2-3.
    [23]Fan Chung, Linyuan Lu. The Small World Phenomenon in Hybrid Power Law Graphs[J]. Notes Phys,2004; 89-140 (650).
    [24]卜月华.图论及其应用[M].南京：南京大学出版社,2002：136-231,157-159.
    [25]G. Frivolt, M. Bielikova. Topology Generation for Web Communities Modeling [C]. Proceedings of the 31 st Conference on Current Trends in Theory and Practice of Computer Science, Springer-Verlag,2005:167-177.
    [26]A. L. Barabasi, R. Albert. Emergence of scaling in random network[J].Science,1999 Oct 15,286(5439):509-512.
    [27]何玉宝,刘正捷,田晓杰.网站拓扑结构提取技术的研究与应用[J].计算机工程,2005,32(1)：157-159.
    [28]Jiawei Han, Micheline Kamber著.范明,孟小峰译.数据挖掘：概念与技术(原书第二版)[M].北京：械工业出版社,2007：351-382.
    [29]Jitesh Shetty, Jafar Adibi, The Enron Email Dataset Database Schema and Brief Statistical Report[R]. Technical report, Information Sciences Institute.2004.
    [30]Rosvall M, Bergstrom C T. An information-theoretic framework for resolving community structure in complex networks[J]. PNAS,2007,104(18):7327-7331.
    [31]GUO Chong-hui, ZHANG Liang. An Analysis Method Based on PCA for the Community Structure in Complex Networks[J]. Operations Research and Management Science,2008, 17(6):144-149.
    [32]Pang-Ning Tan,Steinbach,Vipin Kumar. Introduction to Data Mining[M].Beijing. People's Posts and Telecom Press,2008:47-61,460-484.
    [33]Breese, J. S., Heckerman, D.,& Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering[C]. In Proceedings of the fourteenth annual conference on uncertainty in artificial intelligence (pp.3-52).
    [34]Resnick, P., lacovou, N., Suchak, M., Bergstrom, P.,& Riedl, J. (1994). Grouplens:an open architecture for collaborative filtering of netnews[C]. In Proceedings of the 1994 ACM conference on computer supported cooperative work (pp.175-186).
    [35]T. H. Haveliwala. Topic-sensitive pagerank:a context-sensitive ranking algorithm for Web search[C]. IEEE Transactions on Knowledge and Data Engineering,2003,15(4):784-796.
    [36]J. M. Kleinberg. Authoritative sources in a hyperlinked environment[C]. Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia:Society for Industrial and Applied Mathematics,1998:668-677.
    [37]王晓宇,周傲英.万维网的链接结构分析及其应用综述[J].软件学报,2003,14(10)：1768-1779.
    [38]秦拯,张玲,李娜.改进的PageRank在Web信息搜索引擎中的应用[J].计算机研究与发展,2006,43(6)：1044-1049.
    [39]S. Brin, L. Page. The anatomy of a large-scale hypertextual Web search engine[C]. Proceedings of 7th World Wide Web Conference, Elsevier Science, Amsterdam, 1998:107-117.
    [40]J. M. Kleinberg. Authoritative sources in a hyperlinked environmen[J]. Journal of the ACM, 1999,46(5):604-632.
    [41]Brian Amento, Loren Terveen, Will Hill. Does "Authority" Mean Quality? Predicting Expert Quality Ratings of WEB Documents[C]. In:23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2000:140-145.
    [42]P. K. Reddy, M. Kitsuregawa. An approach to relate the Web communities through bipartite graphs[C]. The 2nd International Conference on Web Information Systems Engineering, IEEE,2001:301-310.
    [43]M. Fiedler. Algebraic connectivity of graphs[J].Czech Math J.1973,23:298-305.
    [44]B.W.Kernighan, S.Lin. An efficient heuristic procedure for partitioning graphs[J]. Bell System Technical Journal,1970,49:291-307.
    [45]Tyler J, Wilkinson D, Huberman B. Email as spectroscopy:Automated discovery of community structure within organizations[C]. International Conference on Communities and Technologies,2003,81-96.
    [46]M.Girvan and M.E.J.Newman. Community structure in social and biological networks, In: Proceedings of the National Academy of Sciences of the United States of America[C]. USA, 2002,99(12):69:66-133.
    [47]J. Rupnik,Finding community structure in social network analysis-overview[R].Department of Knowledge Technologies, Jozef Stefan Institute, Tech. Rep.,2006.
    [48]Zhenya Zhang, Jin Wang, Hongmei Chen, Xufa Wang. An Approach for Spatial Index of Text Information Based on Cosine Similarity[J],Computer Science.2005,32(9):160-163.
    [49]Heng Luo, Changyong Niu, Ruimin Shen, Carsten Ullrich. A collaborative filtering framework based on both local user similarity and global user similarity. Machine Learning[J].2008,72:231-245.
    [50]Yingzi Shang, Runqiu An. Maximum Likelihood Estimation of the Parameters of the Extreme Value Distribution and Computer Implementation[J]. Journal of Hebei Normal University (Natural Science Edition).2006,30(6):643-646.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700