面向图的群体多特征提取与修正技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,复杂网络理论的发展为人类了解各种类型的真实网络提供了理论模型和研究方法。电信行业每天都产生海量的电信数据,电信通信数据已经成为复杂网络研究的主要载体之一。了解网络的群体特征有助于人们更深入地认识网络中群体的结构和特点,而特征修正是保障正确群体特征描述的必要步骤。因此,对复杂网络的特征提取和修正是当今一个非常有前景并且具有挑战性的研究领域。与此同时,研究人员面临的另一个挑战是如何在超大规模网络中进行数据挖掘,工业界和学术界已经使用分布式计算模型,如MapReduce和BSP等,取得了一些有效的成果。
     本文基于大规模电信通信数据,分别从拓扑结构,性别和年龄三个维度深入研究了电信群体的多种特征,并给出了特征提取的并行算法。比较多个关系分类器在电信网络上的效果,利用电信用户的属性信息改进了传统联合推断算法的预测效果,使得准确率大幅提升,并给出了联合推断的并行算法。本文主要工作如下。
     结合目前研究现状,在介绍了不同类型的群体特征的主要内容和研究成果之后,给出了网络群体划分方法,提出并建立了由模块度、节点度分布、聚集系数、平均最短路径组成的网络群体特征体系。提出了多种群体特征提取方法的并行实现,并针对不同的群体特征采用不同的并行计算模型。
     提出了以节点为中心的特征修正框架,给出了4种不同的关系分类器和3种不同的联合推断算法。综合分析了各个算法的特点,并给出了适合于并行化的松弛标记联合推断算法的MapReduce并行化版本,用于对大规模电信数据的联合推断。
     在电信通信数据集上对电信用户的拓扑特征和属性特征进行了分析研究,如邻居、年龄、性别、通话短信次数、通话时长等,从静态和动态两方面对人类通话和短信行为进行了刻画。并分析了电信用户通信的同质性,即用户更倾向于和自已相似的用户产生通信行为,电信运营商可基于此对目标客户进行精准分类与定位,从而进行精准营销。
     在分析了不同关系分类器在电信数据集上的效果之后,选取了准确率最高的邻居加权关系分类器。不同于传统的联合推断,本文不仅利用电信网络的拓扑信息,还利用了不同性别、年龄用户的通话特征,从而深刻揭示了电信用户交往行为的模式和内在特征。本文将松弛标记联合推断算法和决策树规则相结合,改进后的联合推断算法预测用户性别的准确率为93.17%,预测用户年龄的准确率为90.13%。
Recent studies on complex network provide theoretical model and research method for researchers to understanding real-word networks. Understanding the community feature is helpful to understand the network topology and group characteristics better, while feature prediction is a necessary step to ensure the correctness of feature extraction. Therefore, feature extraction and prediction in complex network is a challenging and prospective area. At the same time, the continued exponential growth in both the volume and the complexity of information is giving birth to a new challenge to the researchers. With respect to this challenge, multiple parallel computing platforms, such as MapReduce and BSP, has been emerging.
     Research in this paper are based on the massive telecom data, we present a comprehensive multidimensional study of telecom group feature from topology, gender, age three aspects and provide parallel algorithms for this feature extraction. After compare several relational classifiers, we use the communicaton characteristics of mobile phone users to increase the precison greatly and provide parallel algorithm for feature prediction. The tasks are as follows.
     Based on current research, after introducing main content and research result of different sorts of group characteristics, the community detection methods are provided, and the system of the network community features is proposed, consist of modularity, distribution of node degree, clustering coefficient and average shortest path. We propose parallel algorithms for all the community features mentioned above, using MapReduce or BSP parallel computing model according to different conditions.
     In this paper, we present a node-centric Network learning framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. After introduce four relational_classifiers three collective inference algorithms,that relaxation labeling is suitable for parallelization and the MapReduce parallel algorithm is presented.
     We study the communication behaviors based on the topology of telecom network and attributes of mobile phone users, including gender, age, calling and short message informations to find the hidden behavior patterns of the daily interaction of human beings. We find that people tend to communicate more with each other when they have high similarity. The telecom service provider can target customers and percise marketing based on this analysis.
     We choose weighted-voted relational neighbor classifier (WVRN), with highest predicton precison, to predict features in telecom network. Besides the topology information, we also use the communication features of mobile phone users in the relational model. We combine the WVRN with a communication decision tree, achieving93.17%precison in gender prediction and90.13%precison in age prediction.
引文
1. D.J. Watts and S.H. Strogatz. Collective dynamics of'small-world'networks. Nature.393(6684):409-410,1998.
    2. A.-L. Barabasi and R. Albert. Emergence of Scaling in Random Networks. Science.286(5439):509-512,1999.
    3. M. E. J. Newman. The structure and function of complex networks. SIAM. 45(2):167-256,2003.
    4. Big-Cloud, http://labs.chinamobile.com/
    5. S. Seo, E. J. Yoon, J. Kim, et al. HAMA:An Efficient Matrix Computation with the MapReduce Framework.2010 IEEE Second International Conference on CloudCom,721-726.2010.
    6. J. Dean, S. Ghemawat. Mapreduce:Simplified data processing on large clusters. OSDI'04,137-150.2004.
    7. Hadoop, http://hadoop.apache.org/
    8. Radicchi, F., et al. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences,2004.101(9):p.2658-2663.
    9. M. Faloutsos, P. Faloutsos, C. Faloutsos. On Power-Law Relationships of the Internet Topoloty.
    10. Lee, S.H., P.-J. Kim, and H. Jeong, Statistical properties of sampled networks. Physical Review E,2006.73:p.016102.
    11. Girvan, M. and M. Newman, Community structure in social and biological networks. Proceedings of the National Academy of Sciences,2002.99(12):p. 7821-7826.
    12. M. E. J. Newman, and M. Girvan, Finding and evaluating community structure in networks. Physical Review E,2004.69:p.026113.
    13. M. E. J. Newman. Analysis of weighted networks [J]. PhysRev E,2004, 70:056131.
    14. A. Arenas, J. Duch, A. Fernandez, et al. Community structure in directed networks [J]. New J Phys,2007,9:176.
    15. Newman, M.E.J., Fast algorithm for detecting community structure in networks. Physical Review E,2004.69:p.066133.
    16. Clauset, A., M.E.J. Newman, and C. Moore, Finding community structure in very large networks. Physical Review E,2004.70(6):p.066111.
    17. M. Rosvall and C.T. Bergstrom. An information theoretic framework for resolving community structure in complex networks [J]. P Natl Acad Sci USA, 2007,104(18):7327-7331.
    18. Blonde], V.D., et al., Fast unfolding of communities in large networks. J. Stat. Mech.,2008:p.10008.
    19. Raghavan, U.N., R. Albert, and S. Kumara, Near linear time algorithm to detect community structures in large-scale networks. Physical Review E,2007.76(3): p.036106.
    20. Palla, G., et al., Uncovering the overlapping community structure of complex networks in nature and society. Nature,2005.435:p.814-817.
    21. Kumpula, J.M., et al., Sequential algorithm for fast clique percolation. Phys. Rev. E,2008.78(2):p.026109.
    22. Ahn, Y.-Y, J.P. Bagrow, and S. Lehmann, Link communities reveal multiscale complexity in networks. Nature,2010.466(7307):p.761-764.
    23. S.H. Zhang, X.M. Ning, X.S. Zhang. Identification of functional modules in a PP1 network by clique percolation clustering [J]. Comput Biol Chem,2006, 30(6):445-451.
    24. Evans, T.S. and R. Lambiotte, Line graphs, link partitions, and overlapping communities. Phys. Rev. E,2009.80(1):p.016105.
    25. 徐峰.互联网宏观拓扑结构中社团特征演化分析及应用[D],沈阳:东北大学,2008.
    26. D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. In Proceedings of the Nineteenth International Conference on Machine Learning(ICML),2002b.259-266.
    27. E. Ising. Bertrag zur Theorie des Ferromagnetismus. Zeitschrift f. Physik,1925. 31:253-258
    28. R.B. Potts. Some generalized order-disorder transformations. Cambridge Philosophic Society,1952.48:106-109.
    29. S.Geman and D. Geman. Stochastic relaxation, gibbs distributions and the bayesian restroation of images. IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI),1984.6:721-741.
    30. E. Segal, H. Wang and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics,2003a.19:1264-1272.
    31. E. Segal, R. Yelensky, and D. Koller. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics,2003b. 19:1273-1282.
    32. X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on Machine Learning(ICML),2003.912-919.
    33. A. Popescul and L. H. Ungar. Statistical relational learning for link prediction. In Proceedings of the learning Statistical Models from Relational Data Workshop at the Nineteenth International Joint Conference on Artificial Intelligence(IJCAI),2003.
    34. J. O'Madadhain, J. Hutchins, P. Smyth, Prediction and ranking algorithms for event-based network data, in:Proceedings of SIGKDD 2005, ACM Press, New York,2005, p.23.
    35. Linyuan Lv. Link prediction on complex networks. Journal of University of Electronic Science and Technology of China.5(39),2010.
    36. S. Geisser, Predictive Inference:An Introduction, Chapman and Hall, New York, 1993.
    37. J.L. Herlocker, J.A. Konstann, K. Terveen, J.T. Riedl, Evaluating collaborative filtering recommender systems, ACM Trans. Inf. Syst.22 (2004) 5.
    38. S. A. Macskassy and F. Provost. A simple relational classifier. In Proceedings of the Multi-Relational Data Mining Workshop(MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
    39. C. Perlich and F. Provost. Aggregation-based feature invention and relational concept classes. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2003.167-176.
    40. C. Perlich and F. Provost. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning,2006.62(1/2):65-105.
    41. S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlink. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data,1998.307-319.
    42. Q. Lu and L. Getoor. Link-based classification. In Proceedings of the Twentieth International Conference on Machine Learning(ICML),2003.496-503.
    43. S. A. Macskassy. Classification in Networked Data:A Toolkit and a Univariate Case Study. Journal of Machine Learning Research,2007.8:935-983.
    44. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI),1984.6:721-741.
    45. J.-P. Onnela, J. Saramaki, J. Hyvonen, et al. Analysis of a large-scale weighted network of one-to-one human communication. New Journal of Physics,2007. 9:179.
    46. J. Leskovec, E. Horvitz. Planetary-scale views on a large instant-messaging network. In WWW'08:Proc.17th Int. World Wide Web Conference,2008.915-924.
    47. M. Gonzalez, C. Hidalgo, and A.-L. Barabasi. Understanding individual human mobility patterns. Nature,2008.453(7196):779-782.
    48. F. Giannotti, M. Nanni, and D. Pedreschi. Efficient mining of temporally annotated sequences. In SDM,2006.
    49. F. Giannotti, M. Nanni, F. Pinelli and D. Pedreschi. Trajectory pattern mining. In KDD,2007.330-339.
    50. Y. X. Dong, Q. Ke, J. Rao, B. Wang, B. Wu. Random Walk based Resource Allocation:Predicting and Recommending Links in Cross-Operator Mobile Communication Networks. In ICDM-COMMPER'2011:ICDM Workshop on Mining Communities and People Recommenders,2011.
    51. 吕琳媛.复杂网络链路预测.电子科技大学学报,2010.39(5).
    52. S.-H. Park, H.-J. Lee, S.-P. Han, D.-H. Lee. User Age Profile Assessment Using SMS Network Neighbors' Age Profiles. Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops, 2009.960-965.