多关系社会网络分析和可视化系统的研究

英文题名：The Research of Multi-Relation Social Network Visual Analytic System
作者：索利军
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：多关系社会网络 ; 实体解析 ; 社群发现 ; 可视分析
英文关键词：multi-relation social network ; entity resolution ; community detection ; visual analytic
学位年度：2010
导师：吴斌
学科代码：081202
学位授予单位：北京邮电大学
论文提交日期：2010-01-09

摘要

传统的数据挖掘技术(包括分类,聚类,关联分析等)专注分析维表的属性,却忽略了记录之间所存在的关系。另一方面,现在主要的网络分析方法主要关注网络的拓扑结构分析而没有注意到网络中节点本身所具有的属性。本文提出的多关系社会网络旨在通过构建异构的网络模型来最大限度的保留原始数据的各种信息,并对多关系网络进行进一步的研究。
     本文主要对多关系社会网络做以下几方面的探讨：
     (1)多关系网络建模和网络提取。在对现实数据进行多关系网络建模之后,定义单一网络的抽取操作,从多关系网络中抽取特定意义的单一关系网络。
     (2)多关系社会网络的实体解析。从多个数据源中收集到的数据,只有经过集成和预处理才能被精确的知识发现模型所使用。而在多个数据源的数据进行集成合并到同一个数据集合当中时,会产生很多的重复记录。而这些数据并不是语义上唯一的,通常表示的是同一个实体。正确的合并这些重复的数据是制造高质量数据的至为重要的一部。这个过程被称之为实体解析(entity resolution),本文尝试在使用属性匹配的基础上,通过使用多关系社会网络多关系的特点,提升实体解析的准确率。
     (3)社团划分一直是研究复杂网络的一个重要手段,而目前的社团划分算法主要是使用网络拓扑的信息进行划分。本文的另一个研究点是研究在网络节点有属性的情况下,对网络进行社团划分。在使用网络拓扑的基础上,通过使用节点属性,进一步提高社团划分的准确率。
     (4)可视化,即通过提供统计或交互式视觉表现的软件系统来帮助人们探索和解释数据,是数据挖掘过程中极为重要的一个环节。本文也对多关系社会网络的可视化进行了研究,针对不同的网络类型设计不同的网络视图方案,并提出“网络浏览”的概念,将“网络浏览”应用到一个大规模网络浏览的框架下。
     (5)本文将上述的研究应用于国家科技支撑计划项目《科技文献信息服务系统关键技术研究及应用示范》,开发了一个科技信息可视分析系统(LiterMiner),通过工具证明了上述研究的可行性。
Traditional data mining technologies, including classification, clustering, association rules, etc, focus on analysis of the properties of dimension tables, but ignore the relationship that exists between the records. On the other hand, now the main method of network analysis focuses on the network topology analysis, which did not notice that the node in the networks has the attribute. In this paper, we use multi-relation social network (MRSN) to model the the raw data and do some research on MRSN.
     In this paper, we do some research on MRSN as following:
     (1) Multi-relation social network modeling and network extraction. We propose the process of modeling the multi-relation social network from the raw data, and then define the operators of extracting homogeneous networks from a multi-relation social network.
     (2) Entity resolution in MRSN. Data from relevant sources must be collected, integrated, scrubbed and pre-processed in a variety of ways before accurate models can be mined from it. When data from multiple databases is merged into a single database, many duplicate records often result in. These are records that, while not syntactically identical, represent the same real-world entity. Correctly merging these records and the information is an essential step in producing data of sufficient quality for mining. In this paper, we propose a method which combines link analysis on the basis of the attribute-match method.
     (3) Community detection is an important method to analyze complex networks. The current community detection algorithms merely use the topology structure of the network, but neglect the content of the node. In this paper, we propose an algorithm called CDNA which use not only the topology information but also the content of node to find the communities in the network.
     (4) Visualization, which provides interative software systems to help analyst explore and understand the data, is an important step of the data mining process. This article also researches the visualization of multi-relational social network. We design different views for different type of networks. And we put forward the "Web browser" concept, and use it to a construct a large-scale Web browsing framework.
     (5) Finally, the above research result are applied to develop a literature visual analytic system called LiterMiner, which is supported by a project called "Sicence and Techonolgy Information Service System key technology research and application demonstration," under national science and technology fund.

引文

[1]D. J. Watts and S. H. Strogatz. Collective dynamics of smallworld'networks. In Nature, vol.393, pp.440-442,1998.
    [2]Barabasi, A. L., Albert, R. (1999). Emergence of scaling in random networks. Science,286:509-512.
    [3]Newman M E J. The structure and function of complex network. SIAM Review, 2002.51:1079-1181
    [4]Knobbe, A., Siebes, A., Blockeel, H., Van der Wallen, D. Multi-Relational Data Mining, using UML for ILP, In Proceedings of PKDD 2000, LNAI 1910,2000
    [5]Knobbe, A., Blockeel, H., Siebes, A., Van der Wallen, D. Multi-Relational Data Mining, In Proceedings of Benelearn'99,1999
    [6]H. Newcombe, J. Kennedy, S. Axford, and A. James. Auto-matic linkage of vital records. Science,130:954-959,1959.
    [7]I. Fellegi and A. Sunter. A theory for record linkage. J. American Statistical Association,64:1183-1210,1969.
    [8]A. Agresti. Categorical Data Analysis.Wiley,NewYork, NY,1990.
    [9]M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proc. SIGMOD-95, pages 127-138,1995.
    [10]A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. KDD-00, pages 169-178,2000.
    [11]A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. SIGMOD-97 DMKD Wkshp.,1997.
    [12]S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. KDD-02, pages 269-278,2002.
    [13]P. Singla and P. Domingos. Object identification with attribute-mediated dependences. In Proc. PKDD-05, pages 297-308, Porto, Portugal,2005. springe
    [14]W. Shen, X. Li, and A. Doan. Constraint-based entity match-ing. In Proc. AAAI-05, pages 862-867, Pittsburgh, PA,2005. AAAI Press.
    [15]J. Davis, I. Dutra, D. Page,, and V. Costa. Establishing identity equivalence in multi-relational domains. In Proc. ICIA-05,2005.
    [16]Givan M, Newman MEJ. Community structure in social and biological networks 2002. PNAS,99:7821-7826.
    [17]Du N, Wu B, Pei X, et al. Community Detection in Large-Scale Social Network. In:Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis,2007, p.16-25.
    [18]Newman, MEJ. Fast algorithm for detecting community structure in networks 2004. Phys.Rev.E,69,066133.
    [19]Newman M E J, Girvan. Finding and evaluating community strurcture in networks. Phys. Rev E,2004,69:026113
    [20]Palla J, Vicesek G, Vicsek T, et al. Uncovering the overlapping community structure of complex network in nature and society 2005. Nature,435:814-818.
    [21]Ducn J, Arenas A. Community detection in complex network using extream optimization. Physical Review E,2005,72:027104
    [22]Tasgin, M. and H. Bingol. Community Detection in Complex Networks using Genetic Algorithms.2006 http://arxiv.org/pdf/cond-mat/0604419.
    [23]Pizzuti, C., Community detection in social networks with genetic algorithms, in Proceedings of the 10th annual conference on Genetic and evolutionary computation.2008, ACM:Atlanta, GA, USA. p.1137-1138.
    [24]WebTAS. http://www.webtas.com/,2007.
    [25]i2-Analyst's Notebook. http://www.i2inc.com/,2007.
    [26]T. Kapler and W. Wright. GeoTime Information Visualization. Information Visualization,4(2):136 146,2005.
    [27]D. Jonker, W. Wright, D. Schroh, P. Proulx, and B. Cort. Information Triage with TRIST. In 2005 International Conference on Intelligence Analysis, May 2005.
    [28]W. Wright, D. Schroh, P. Proulx, A. Skaburskis, and B. Cort. The Sandbox for analysis:concepts and methods. In ACM CHI'06, pages 801 810, April 2006.
    [29]A. Vilanova, A. Telea, G Scheuermann, and T. Moller, Investigative Visual Analysis of Global Terrorism EUROGRAPHICS 2008
    [30]John Stasko, Carsten Gorg, Zhicheng Liu, Jigsaw:supporting investigative analysis through interactive visualization Information Visualization (2008) 7, 118-132
    [31]Remco Chang etc, Wire Vis:Visualization of Categorical, Time-Varying Data From Financial Transactions, Proceedings Visual Analytical Science and Technology,2007
    [32]VADL http://vadl.cc.gatech.edu/
    [33]C. Chen, "Citespace ii:Detecting and visualizing emerging trends and transient patterns in scientific literature," J. Am. Soc. Inf. Sci. Technol., vol.57, no.3, pp. 359-377,2006.
    [34]Y. Sun, T. Wu, Z. Yin, H. Cheng, J. Han, X. Yin, and P. Zhao, "Bibnetminer: mining bibliographic information networks," in SIGMOD'08:Proceedings of the 2008 ACM SIGMOD international conference on Management of data. New York, NY, USA:ACM,2008, pp.1341-1344.
    [35]J. Stasko, C. G"org, and Z. Liu, "Jigsaw:supporting investigative analysis through interactive visualization," Information Visualization, vol.7, no.2, pp. 118-132,2008.
    [36]http://nwb.slis.indiana.edu/
    [37]Bongshin Lee, Mary Czerwinski, George Robertson, and Benjamin B. Bederson (2004) Understanding Eight Years of Info Vis Conferences using PaperLens, Posters Compendium of Info Vis 2004, pp.53-54
    [38]Mustafa Bilgic, Louis Licamele, Lise Getoor, Ben Shneiderman. D-Dupe:An Interactive Tool for Entity Resolution in Social Networks. Proceedings of IEEE Symposium on Visual Analytics Science and Technology 2006 (VAST'06).
    [39]Jie Tang, Mingcai Hong, Jing Zhang, Bangyong Liang, Limin Yao, and Juanzi Li. ArnetMiner:Toward Building and Mining Social Networks. (Demo) In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2007)
    [40]索利军吴斌.生命科学领域科研合作网的分析.数字图书馆论坛.2008(6)：2-6
    [41]索利军胡德勇吴斌.<病毒学报>科研合作网的分析.数字图书馆论坛.2008(6)：7-11
    [42]Bin Wu, Fengying Zhao, Shengqi Yang, LijunSuo and Hongqiao Tian. Characterizing the evolution of collaboration network. Conference on Information and Knowledge Management, Proceeding of the 2nd ACM workshop on Social web search and mining, pages 33-40.2009.
    [43]Bin Wu, Lijun Suo. LiterMiner:A Literature Visual Analytic System. The 1st International Conference on Information Science and Engineering.2009