Web社区发现算法的研究

英文题名：A Syudy of Web Community Detection Algorithm
作者：黄伟平
论文级别：硕士
学科专业名称：电子与通信工程（专业学位）
中文关键词：复杂网络 ; 社区结构 ; 社区吸引力 ; 社区发现 ; 重叠社区
英文关键词：Complex Networks ; Community Structure ;
英文关键词：Community Attractive ; Community Detection ;
英文关键词：Overlapping Community
学位年度：2013
导师：刘瑞芳
学科代码：0852
学位授予单位：北京邮电大学
论文提交日期：2013-01-05

摘要

社区发现技术是网络研究中一项非常重要的技术,因为社区结构在定程度上反映了真实系统的拓扑关系,所以,识别出网络图中的潜在社区具有非常重要的研究意义。特别需要指出的,随着互联网的飞速发展,其已经成为了人类社会生活中不可缺少的一部分,发现并分析存在于互连网中的社区则具有史深刻的意义。我们知道,现实生活中的各种社交圈子可以在某些层面上,反映出人与人之间的关系。而同一社交圈子的人通常在网络上也会存在一些联系,因此对诸如FaceBook、 Twitter、新浪微博、天涯、猫扑、豆瓣网、人人网等在线社交网络进行社区结构分析,进而发现人们之问存在的各种潜在关系,这样不仅可以快速的了解和预测人们的活动,而且还可以更加有效的监测社会舆论的发展情况,同时也可以在对比现实社区和虚拟网络社区中人们活动的异同时给我们提供依据。另外,现如今基本每一个在线商城都有商品推推荐功能,而这个功能的实现其实也是基于社区发现技术的。商品推荐系统的关键其实就是要把那些具有相似购买兴趣的顾客从庞大的顾客和商品购买关系网络中识别出来,这个过程其实就是一个社区发现的过程。再者,互联网中充斥着各种各样的信息,想要快速的从这些海量数据中,提取出用户需要的信息是很不容易的。如果我们对互联网中存在的这些信息进行社区发现的话,不仅可以把这个问题解决,还可以实现针对个人的信息推荐,以及网络中的智能搜索功能,从而带领用户,让他们更加准确和快速的找到自己感兴趣的信息。
     本文首先对一些和社区发现技术有关的理论知识进行了介绍,之后又对一些早期的比较经典的社区发现算法,如传统图分割算法Kernighan-Lin算法、层次聚类方法中的GN算法、基于模块度优化的Newman快速算法(其也属于凝聚算法)、谱分析算法普分法、以及Palla等人提出的用于重叠社区发现的CPM算法等进行了介绍,同时还对它们的优势及不足,算法的复杂度度及适用范围进行了分析。此外,还对社区质量客观评价标准进行了介绍。
     本文在深入分析并理解现有社区发现算法的基础上,结合微博这种双向性社交网络的特性,提出了两个针对微博的社区发现新算法,分别是：1)基于社区吸引力的社区发现算法和2)基于社区吸引力的重叠社区发现新算法。其中,算法1)的提出,是为了解决现有社区发现算法在面对微博这种大规模网络时复杂度过高而难以应用的问题。算法1)不仅在时间复杂度方面优于现有的社区发现算法,同时在社区发现精准度方面也有不俗的表现。而算法2)的提出,是因为我们考虑到在现实世界中一个用户可能会同时属于多个社区,而现有的大多数算法以及我们所提出来的算法1)都是简单把每个用户划分到一个单独的社区中,而这与事实有点不太相符；同时,现有的一些重叠社区发现算法在面对大规模网络时性能不佳,算法2)解决这些问题。最后,我们还给出了这两个算法的实验结果,这些实验结果对我们所提出的两个算法在社区发现的有效性和高效性上均给出了有力的证明。
Community Detection is a very important technology in network research. Because, in some extent, Community structure could reflect the topology relationship of real system, and there is great significance to find the potential communities in network. Particularly, with the rapid development of the Internet, it has become an indispensable part of human's social life, so discovery and analyze the potential communities in the Internet has a further significance. We know that, in reality, the variety of social circles, in some levels, could reflect the relationship among people. The peoples in the same social circle usually also have some contacts in the Internet, Doing community structure analysis of the online social networks such as FaceBook, Twitter, Sina, Tianya, Mop could find the potential relationship exists among people, then we not only can quickly understand and predict people's activities, but also can carry out a more effective monitoring of the development of public opinion. At the same time, it can provide some basis when we compare the activities of people between reality communities and virtual network community. On the other hand, nowadays, commodity recommended function is a function which most online malls provided, the realization of this function is actually based on community detection technology. The key of commodity recommended system is identifying the customers with similar buying interest from the relationship network of customers and commodities, this process is actually the process to do community detection among customers. Moreover, there are variety of information filled with the Internet, so getting the information we wanted is not an easy thing. If we do community detection among these information of the Internet, then, not only can this problem be solved and also can achieve the personal information recommended function, as well as the intelligent search function, which lead the user to find the information they interested more accurately and quickly.
     In this paper, we first introduced the theoretical knowledge of community detection, and then give a brief review of some traditional algorithms, such as Kernighan-Lin algorithm(a segmentation algorithm), GN algorithm(a hierarchical clustering algorithm), Newman-fast algorithm(a algorithm based on module optimization), CPM algorithm(a algorithm to find overlapping communities). To each algorithm, we also analyzed its advantages and disadvantages, as well as its complexity. In addition, we also introduced the objective quality evaluation criteria of community.
     After making deep analysis and understanding of the existing community detection algorithms, we proposed two new community detection algorithms. The two new algorithms are both designed for micro-blog, which is a bidirectional social network. The first algorithm is community detection algorithm based on the community attractive; and the second one is overlapping community detection algorithm based on community attractive. The reason why algorithm one is proposed, is to solve the problem that the existing community detection algorithms are of high complexity when used to analyze the large-scale network like micro-blog. Algorithm one not only performs better than the existing community detection algorithms in time complexity, also performs well in accuracy. In real world, a people may belong to multiple communities at the same time, but most existing algorithms and algorithm one just simply divide each individual into a single community, so the results of these algorithms are not quite consistent with the fact, that is why algorithm two is proposed; the existing overlapping community detection algorithms are of poor performance for large-scale network is also the reason. Finally, the simulation results give a strong proof to the effectiveness and efficiency of the two algorithms.

引文

[1]R. Albert and A.-L Barabasi. Statistical Mechanics of Complex Networks. Rev. Modern. Phys.,2002.74, P.47-97.
    [2]M.E.J. Newman, A.-L. Barabasi and D. Watts. The Structure and Dynamics of Networks. Princeton University Press,2006.
    [3]汪小帆,李翔,陈关荣.复杂网络理论及其应用[M].北京：清华大学出版社,2005.
    [4]V. D. Blondel, J.-L. Guillaume, R. Lambiotte and E. Lefebvre. Fast unfolding of communities in large networks. J. Stat. Mech.,2008, P.10008.
    [5]S. Fortunate. Community detection in graphs. Physics Reports,2010.486(3-5), P.75-174.
    [6]S. Fortunate and C. Castellano. Community Structure in Graphs.2007 arXiv:0712.2716
    [7]Weiss, R. S., and E. Jacobson. A method for the analysis of the structure of complex-organizations. American Sociological Review,1955.29, P.661
    [8]S.A. Rice. The identification of blocks in small political bodies. The American Political Science Review,1927.21, P.619.
    [9]B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal,1970.49(2), P.291-307.
    [10]ER Barnes. An algorithm for partitioning the nodes of a graph. S1AM Journal on Algebraic and Discrete Methods,1982
    [11]M.E.J. Newman Fast algorithm for detecting community structure in networks. Physical Review E,2004.69(6), P.066133.
    [12]A. Clauset, M.E.J. Newman and C. Moore. Finding community structure in very large networks. Physical Review E,2004.70(6), P.066111.
    [13]M. Girvan and M.E.J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 2002.99(12), P.7821.
    [14]M. Rosvall and C.T. Bergstrom. Maps of random walks on complex networks reveal community structure. The National Academy of Sciences of the USA,2008.105(4), P.1118-1123
    [15]V.D. Blonde, J.-L. Guillaume,R. Lambiotte and E. Lefebvre. Fast unfolding of communities in large Networks. Journal of Statistical Mechanics,2008, P10008
    [16]P. Ronhovde and Z. Nussinov. Multiresolution community detection for megascale networks by information-based replica correlations. PHYSICAL REVIEW E,2009.80, P.016109.
    [17]G. Palla, I. Derenyi, I. Farkas and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature,2005.435, P.814.
    [18]J. M. Kumpula, M. Kivela, K. Kaski and J. Saramaki. Sequential algorithm for fast clique percolation. Phys. Rev. E 2008.78(2), P026109.
    [19]J. Hopcroft, O. Khan, B. Kulis and B. Selman. Tracking evolving communities in large linked networks. Proc.Natl. Acad. Sci. USA,2004.101,P5249.
    [20]G. Palla, A.-L. Barabasi and T. Vicsek. Quantifying social group evolution. Nature,2007.446, P.664
    [21]S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen and E. R..Using Automatic Clustering to Produce High-Level Organization of Source Code. Proceedings of the 6th International Workshop on Program Comprehension (IEEEComputer Society, Washington, DC, USA).1998
    [22]F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America,2004.101(9), P.2658-2663.
    [23]M E J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys.Rev.E,2004.69, P.026113.
    [24]姚灿中.产业复杂网络的建模,仿真与分析.2010,华南理工大学.
    [25]L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry 1977.40, P.35.
    [26]W. Donath, and A. Hoffman. Lower bounds for the partitioning of graphs. IBM Journal of Research and Development 1973.17(5), P.420.
    [27]M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 1973.23(2), P.298-305.
    [28]A. Pothen, H.D.Simon and K.P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J.MATRIX ANAL.APPLIC,1990.11(3), P.430-452.
    [29]C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J.Res.Nat.Bur.Standards,1950.45(4), P.255-282.
    [30]M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in networks. Proc. Natl. Acad.Sci. USA.2007.104, P.9564.
    [31]K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic block structures. J. Am. Stat. Assoc.2001.96(455).
    [32]Snijders, T. A. B., and K. Nowicki, Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classif.1997.14, P.75-100.
    [33]M.E.J. Newman. Analysis of Weighted Networks. Physical Review E., 2004.70,P.056131.
    [34]A. Arenas, J. Duch, A. Fernandez and S. Gomez. Size reduction of complex networks preserving modularity. New J. Phys.,2007.9, P.176
    [35]E. A. Leicht and M. E. J. Newman. Community Structure in Directed Networks. Phys. Rev. Lett.,2008.100, P.118703
    [36]H. Shen, X. Cheng, K. Cai and M.-B. Hu. Detect overlapping and hierarchical community structure in networks. Physica A:Statistical Mechanics and its Applicatio ns,2009.388(8),P.1706-1712
    [37]R. Guimera, M. Sales-Pardo and L. A. N. Amaral. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E,2004.70(2), P.025101 (R).
    [38]J. Reichardt and S. Bornholdt. Statistical mechanics of community detection. Phys. Rev. E,2006.74(1), P.016110
    [39]S. Muff, F. Rao and A. Caflisch. Local modularity measure for network clusterizations. Phys. Rev. E,2005.72, P.056107
    [40]M.E.J. Newman Scientific collaboration networks. Phys. Rev. E,2001.64, P.016131
    [41]http://snap.stanford.edu/
    [42]http://mbostock.github.com/protovis/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700