基于属性扩展图的K-means聚类算法的研究

英文题名：Research on the K-means Clustering Algorithm Based on Attribute Augumented Graph
作者：任巍英
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：K-means聚类算法 ; 属性扩展图 ; 邻接随机行走距离
英文关键词：K-means Clustering Algorithm ; The Attribute Augument Graph ; Neighborhood Random Walk Distanc
学位年度：2012
导师：高媛
学科代码：081202
学位授予单位：中北大学
论文提交日期：2012-04-24
答辩委员会主席：李德玉

摘要

社团结构是社会网络普遍存在的拓扑特性之一，发现社会网络中的社团结构是复杂网络研究的基础性问题。聚类算法是发现社团结构的一种重要的方法。聚类分析技术在过去的许多年中得到了广泛的研究，其中K-means聚类算法是众多聚类算法中比较经典的一个。K-means聚类算法由于思想简单、时间复杂度小而被广泛的进行了研究与运用，尤其在对大规模数据进行挖掘中，K-means聚类算法具有高效性及可伸缩性。
     真实网络中，除了数据之间存在的拓扑结构以外，其数据本身存在着各种特殊属性。现存的许多聚类算法仅仅依靠数据间的拓扑结构进行聚类，而很大程度的忽略了数据所具有的特有属性在聚类分析中的作用。本文在分析聚类算法中节点的拓扑结构及特有属性的作用之后，对K-means聚类算法进行改进，提出了一种新的聚类算法-SAK聚类算法。本文的主要研究成果如下：
     （1）将真实网络用图的模型表示，并根据现实网络的实用性，将节点的属性特性作为节点添加到图中，并根据节点与属性的关系添加相应的边，从而构成属性扩展图。在属性扩展图的基础上，使用随机行走模型对节点的结构及属性相似性进行统一的测量。
     （2）提出了一种自动更新权重值的方法，在聚类算法不断迭代的过程中，节点边的权重会随之发生变化，随着权重的改变节点间的相似度也会随之改变，这样，不同的属性将会在聚类算法中起到不同的作用。这种改变将会使节点间相似度的测量更加趋于实际，趋于准确。
     （3）提出基于属性扩展图的K-means聚类算法（SAK），该算法改变K-means算法随机选取初始聚类中心的方法，采用密度函数的方法进行初始聚类中心的选取。并运用两个真实的社会网络对本文所提出的SAK聚类算法进行了验证。
Community structure is one of the common topological characteristics of social networks.Community detection has become a fundamental problem in the research field of socialnetworks. The clustering algorithm is an important way to find community structure. Theclustering analysis technique has been widely studied in the past many years, one of theclustering algorithm is k-menas clustering algorithm, which is the most classical. Because ofsimple thinking and short time complexity , k-means is wide studied and applied. Especiallythe k-means algorithm is scalable and high efficient in large-scale database.
     In the real network, there is topological structure between datas, and there are specialproperties too. Many existing clustering methods mainly focus on the topological structure,but largely ignore the data properties, which are often important in the clustering algorithm.The topological structure and properties of data play an important role on the clusteringanalysis.In this paper, we analyze the important of topological structure and properties in theclustering algorithm, then impore the K-means clustering algorithm. We propose a newclustering algorithm (SAK). The main contributions of this paper are summarized below:
     (1) We use graph model to said real network. In accordance with the practicality of thereal network, we add the properties characteristic of the node as a node in the graph, and addthe edges in accordance with the relationship between the nodes and the attribute. Then theattribute augment graph is constituted. A random walk model is used to measure the vertexcloseness through the topological structure and properties based on the attribute augmentgraph.
     (2)We propose a weight self-adjustment method. In the iterative process of clusteringalgorithm, the weight of edges and the vertex closeness is updated too. The different attriburewill play different role in the clustering algorithm. The change will make the measurement ofsimilarity become more practical and accurate.
     (3) The k-means clustering algorithm based on the attribute augument graph is putforward. The clustering algorithm change the method of randomly selecting the initialclustering centers on the k-means clustering algorithm. The density function method is used toselect the initial clustering centers. We use two real network to verify the clustering algorithm.

引文

[1]孙鹏岗.聚类算法研究及其在网络模块性分析中的应用[D].西安:西安电子科技大学. 2011.
    [2] Santo Fortunato. Community detection in graphs[J]. Physics Reports. 2010, 486(3):170-174.
    [3] Raied SalmanVojislav KecmanQi Li, Robert StrackErick Test. Fast K-means AlgorithmClustering[J]. International Journal of Computer Networks & Communications. 2011, 3(4):654-660.
    [4] Raghuvira Pratap AK Suvarna VaniJ Rama DeviDr.K Nageswara Rao. An EfficientDensity based Improved K- Medoids Clustering algorithm[J]. International Journal ofAdvanced Computer Sciences and Applications. 2011, 2(6): 837-842.
    [5]郭春艳.基于连接度的图聚类算法研究[D].山西:山西大学. 2008.
    [6]袁冠,夏士雄,张磊,周勇.基于结构相似度的轨迹聚类算法[J].通信学报. 2011,32(9): 103-121.
    [7]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报. 2008. 19(1).48-61.
    [8] J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu.Graphscope: parameter-free mining oflarge time-evolvinggraphs[J]. Knowledge Discovery in Databases. 2007,6: 687-696.
    [9] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Integrating clustering with rankingfor heterogenous information network analysis[J]. Conf. Extending Database Technology.2009, 8: 565-576.
    [10]朱娴,马卫.一种基于层次聚类的双聚类算法[J].微计算机应用. 2009,30(5):12-16.
    [11]尚海昆. K-means聚类算法的研究[D].保定,华北电力大学.2009.
    [12]杨建红.基于密度的聚类算法研究[D].长春:长春工业大学硕士学位论文. 2010.
    [13]杨草原,刘大有,杨博,池淑珍,金弟.聚类集成方法研究[J].计算机科学. 2011,38(2): 166-173.
    [14] Alexander Hinneburg,Daniel A.Keim.An Efficient Approaxh to Clustering in LargeMultimedia Databases with Noise[J]. Knowledge Discovery and Data Mining. 1998,8:58-65.
    [15]张科泽,杨鹤标,沈项军,蒋中秋.基于节点数据密度的分布式K-means聚类算法研究[J].计算机应用研究. 2011, 28(10): 3643-3648.
    [16]张雪风,张桂珍,刘鹏.基于聚类准则函数的改进K-means算法[J].计算机工程与应用. 2011, 47(11): 123-129.
    [17] Mohit Kumar K, K. Agrawal Deepak, Arora Reena Mishra. Implementation andBehavioural Analysis of Graph Clustering using Restricted Neighborhood SearchAlgorithm [J]. International Journal of Computer Applications. 2011, 22(6);231-241.
    [18] Gao Lin, Sun Peng Gang, Song Jia. Clustering algoritms for detecting functionalmodules in protein interaction networks[J]. Journal of Bioinformatics an ComputationalBiology. 2009,7(1),217-242.
    [19] Y. Tian, R. A. Hankins, and J. M. Patel. Ecient aggregation for graph summarization[J].Management of Data .2008 ,7: 567-580.
    [20]李仁侃,叶东毅.粗糙K-Modes聚类算法[J].计算机应用. 2011,31(1): 97-102.
    [21] Hae-Sang Park, Chi-Hyuck Jun. A simple and fast algorithm for K-medoids clustering[J].Department of Industrial and Management Engineering; 2008.98. 424-431.
    [22] Juanying Xie, Shuai Jiang, Weixin Xie, Xinbo Gao. An Efficient Global K-meansClustering Algorithm[J]. Journal of Computers, 2011, 2(6), 271-279.
    [23]贾宗维.基于节点结构互联性的图聚类算法研究[D].山西:山西大学硕士学位论文. 2009.
    [24] N. Suguna. Predicting Missing Attribute Values Using k-Means Clustering[J]. Journal ofComputer Science 2011, 7 (2): 216-224.
    [25]周世兵,徐振源,唐旭清.新的K-均值算法最佳聚类数确定方法[J].计算机工程与应用. 2010,46(16):27-31.
    [26] K. Prasanna, M. Sankara Prasanna Kumar. A Novel Benchmark K-Means Clustering onContinuous Data[J]. 2011,8(3):2974-2978
    [27] D. NapoleonS, Pavalakodi. A New Method for Dimensionality Reduction UsingK-Means Clustering Algorithm for High Dimensional Data Set[J]. International Journalof Computer Applications. 2011, 13(7):478-486.
    [28]白旭,勒志军. K-中心点聚类算法优化模型的仿真研究[J].计算机仿真. 2011, 28(1):218-223.
    [29] Juanying, XieShuai, JiangWeixin, XieXinbo Gao. An Efficient Global K-meansClustering Algorithm[J]. Journal of Computers. 2011, 6(2): 968-974.
    [30] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Integrating clustering with rankingfor heterogenous information network analysis[J]. Extending Database Technology.2009,8: 565-576.
    [31]屈新怀,高万里,丁必荣,李朕.基于聚类数和初始值得K-means算法改进研究[J].组合机床与自动化加工技术. 2011, 4:42-48.
    [32]李亚飞.复杂网络中的社团结构检测算法研究[D].北京,北京交通大学,2011.
    [33] Yang Zhou,Hong Cheng ,Jeffrey Xu Yu. Graph Clustering Based on Structural/AttributeSimilarities[J]. Department of Systems Engineering and Engineering Management.2009,8.
    [34]李强,何衍,蒋静萍.一种基于随机游动的聚类算法[J].计算机工程与应用.2009,31,3:523-526.
    [35] P. Pons and M. Latapy. Computing communities in larger networks using randomwalks[J].Journal of Graph Algorithmsand Applications.2006, 10(2):191-218.
    [36] D.VanisriDr, C.Loganathan. An Efficient Fuzzy Clustering Algorithm Based on ModifiedK-Means[J]. International Journal of Engineering Science and Technology. 2011, 2(10):245-232.
    [37] PmichaelLaszlo,Psumitra Mukherjee. A Genetic algorithm that exchanges neighboringcenters K-means algorithm[J]. Pattern Recognition Letters. 2007. 28(16): 2359-2366.
    [38] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error[J].Management of Data. 2008,6, 419-432.
    [39] D. Dueck, B. J. Frey. Non-metric affinity propagation for unsupervised imagecategorization[J]. IEEE International Conference on Computer Vision(ICCV). 2007.1-8.
    [40] Antonio Augusto Chaves , Luiz Antonio Nogueira Lorena. Clustering search algorithmfor the capacitated centered clustering problem[J]. Computers & operations research.2010, 37(3): 1245-1252.
    [41] Yaghub pirzadeh, Jamal shahrabi, Mohamad taghi taghavifard. Rapid Ant basedclustering-genetic algorithm (RAC-GA) with local search for clustering problem[J].International Journal of Industrial Engineering Computations. 2012, 3(3): 698-705.
    [42]孙鹏岗.聚类算法研究及其在网络模块性分析中的应用[D].西安:西安电子科技大学. 2011.
    [43]李晓佳,张鹏,耿增如,樊瑛.复杂网络中的社团结构[J].复杂系统与复杂性科学.2008.5(3).19-42
    [44]骆志刚,丁凡,蒋晓舟,石金龙.复杂网络社团发现算法研究新进展[J].国防科技大学学报. 2011, 33(1): 47-55.
    [45] T.Velmurugan, T.Santhanam. Clustering Mixed Data Points Using Fuzzy C-MeansClustering Algorithm for Performance Analysis[J]. 2011, 2(9): 254-262.
    [46] L. Duan et al. A Local Density Based Spatial Clustering Algorithm with Noise[J]. IEEEInternational Conferenceon Systems. 2006, 5(9): 8-11.
    [47] Yanfeng Zhang, Xutao Li, Yunming Ye, Xiaofei Xu, Shengchun Deng. Informationtheoretic Agglomerative K-means[J]. Information Technology Journal. 2011, 10(12):2420-2420.
    [48] M. Zarei, K. A. Samani. Eigenvectors of network complement reveal communitystructure more accurately[J]. Physica A. 2009.388.1721-1730.
    [49] Qingfeng Li, Wenfeng Wen. A new clustering algorithm for large datasets[J].J.Cent.South Univ. Technol. 2011, 18:823-829.
    [50] B. J. Frey, D. Dueck. Clustering by passing messages between data points[J]. Science.2007.315.972-976.
    [51] H. Tong, C. Faloutsos, J.-Y. Pan. Fast random walk with restart and its applications[J].Data Mining. 2006,12: 613-622.
    [52] Prachi M, JoshiParag, A. Kulkarni. A Novel Approach for Clustering based on PatternAnalysis [J]. International Journal of Computer Applications. 2011, 25(4):321-334.
    [53] Ku. KamilllaSusant, K. Mohapatra. A New Approach on K-Means Clustering[J].International Journal of Computer Science and Information Security. 2012, 9(11):142-152.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700