一种并行分层聚类算法的研究和实现

英文题名：The Research and Realizing of Parallel Cluster Analysis Algorithm
作者：李朝健
论文级别：硕士
学科专业名称：计算机技术
中文关键词：层次聚类 ; 并行算法 ; 存储冲突 ; EREW-SIMD ; 高性能计算
英文关键词：Hierarchical clustering ; Memory conflicts ; Parallel algorithms ; EREW-SIMD ; High performance computing
学位年度：2007
导师：肖建华
学科代码：081203
学位授予单位：湘潭大学
论文提交日期：2007-11-01

摘要

聚类分析是数据挖掘的重要研究领域之一,在工程、商业、生命科学、社会科学以及其他许多领域得到了广泛的应用。但由于聚类对象在高维特征空间分布的复杂性,聚类效果评价的不确定性和灵活性,以及聚类作为一个优化问题求解的高计算复杂性,聚类算法仍然面临着众多的问题和挑战。
     目前,研究者提出了大量的聚类算法。其中层次聚类算法是其中的主要方法之一,受到了大量学者的密切关注。目前最好的串行算法的时间复杂性可达到O(n2),但依然难于处理生物信息学或入侵检测中的海量数据;并行算法目前多基于CREW-PRAM或CRCW-PRAM模型,其运行成本不低于O(n2)。这些算法多使用随机或概率算法,而且算法中的处理器数目无法根据运行环境改变,也没有考虑各并行处理器对共享存储器的存储冲突。本文通过利用完全图求欧几里德最小生成树算法和无存储冲突的连通分支确定算法,提出一种基于EREW-SIMD共享存储模型的无存储冲突并行层次聚类算法,其成本为O(n2)。通过与其他算法性能比较,比较结果说明本文提出的算法在保证存储无冲突的前提下,能以最优的成本在最弱的PRAM—EREW模型运行,且处理器可根据实际可用的计算资源动态调整。
     为了验证本文算法的性能,利用基准测试数据集在高性能计算中心的IBM P690机器上进行了计算实验。实验结果表明:本文算法在计算时间和空间上具有一定的比较优势,对大规模数据集具有较强的可扩展性。
Clustering analysis is one of the important areas in the data mining research. Especially, it is applicable in many fields, such as engineering, business, biology, social sciences and others. However, because of the complexity of the distribution of clustering object at high dimensional feature space, the uncertainty and the flexibility of the evaluation of clustering results and high computing complexity of cluster problem as an optimized problem, clustering algorithm still are confronted with lots of problems and challenges.
     At present, researchers put forward lots of clustering algorithms. Therein, hierarchical clustering algorithms as a main kind, are paid much attention. So far time complexity of the best sequential algorithms is up to O(n2), but huge data in biological information or intrusion detection is hard for them to process; most of the parallel algorithms use the CRCW-PRAM or CREW-PRAM models of computing, the cost of these algorithms are O(n2) or larger, most are randomized or probability algorithms, unable to automatically adapt to the number of processors, and not take account of sharing memory conflicts. This paper proposed a new parallel adaptive deterministic algorithm without memory conflicts based on EREW-SIMD shared memory model by use of complete graph and Euclidean minimum spanning tree, with cost of O(n2) . Through performance comparison with other algorithms, it is demonstrated that on guarantee of no memory conflicts, the algorithm can run on the weakest PRAM-EREW model at an optimal cost, and automatically adapt to the change of the number of processors.
     To validate the performance and extensibility of our algorithm in practice, we made several experiments and simulations based on the synthetic and benchmark data sets from Internet on IBM P690 of high performance computing centre. The results showed that our algorithm has a great superiority in both computing time and space, and consequently a stronger adaptability and operability for large scale data sets.

引文

[1] Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2,283-304. 1998.

    [2] J€ager, J., Kriegel, H.P., A fast parallel clustering algorithm for large spatial databases. Data Mining Knowledge Discovery 3 (3), 263-290. 1999.

    [3] Lance, G.N., Williams, W.T., A general theory of classificatory sorting strategies: hierarchical ystems. Comput. J. 9, 373-380.1967.

    [4] Li, X., Fang, Z., Parallel clustering algorithms. Parallel Comput. 11, 275-290.1989.

    [5] C.F. Olson, "Parallel Algorithms for Hierarchical Clustering,"Parallel Computing, vol. 21, pp. 1313-1325,1995

    [6] Sanguthevar Rajasekaran, Efficient Parallel Hierarchical Clustering Algorithms. IEEE transactions on parallel and distributed systems, vol. 16, no. 6, june 2005:497-502.

    [7] X. Li and Z. Fang, "Parallel Clustering Algorithms," Parallel Computing, vol. 11, pp. 275-290, 1989.

    [8] X. Li, "Parallel Algorithms for Hierarchical Clustering and Clustering Validity," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, pp. 1088-1092,1990.

    [9] Olson, C.F., Parallel algorithms for hierarchical clustering. Parallel Comput. 21 (8), 1313-1325. 1995.

    [10] Rasmussen, E.M., Willett, P., Efficiency of hierarchical agglomerative clustering using ICL distributed array processor. J. Doc. 45 (1), 1-24. 1989.

    [11] De Hoon, M.J.L., Imoto, S., Nolan, J., Miyano, S., Open source clustering software. Bioinformatics 20,1453-1454, 2004.

    [12] Kaufman, L., Rousseauw, P.J., Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York. 1990

    [13] Ng, R., Han, J., Efficient and effective clustering methods for spatial data mining. In: proceedings of VLDB-94. 1994.

    [14] Ni, L., Jain, A., A VLSI systolic architecture for pattern clustering. IEEE Trans. Pattern Anal. Machine Intell. 7 (1), 80-89. 1985.

    [15] Alsabti, K., Ranka, S., Singh, V., An efficient k-means clustering algorithm. In: Proc. of the First Workshop on High Performance Data Mining. 1995.

    [16] Blake, C.L., Merz, C.J.,. UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. [http://www.ics.uci.edu/mlearn/MLRepository.html]. 1998.

    [17] Bradley, P.S., Fayyad, U.M., Refining initial points for kmeans clustering. In: Proc. of the UCAI-93, San Mateo,CA, pp. 1058-1063.1983.

    [18] Bradley, P.S., Fayyad, U.M., Reina, C., Scaling clustering algorithms to large databases. In: Proc. of the 4th Internat. Conf. on Knowledge Discovery and Data Mining,pp. 9-15.1998.

    [19] Sibson, R., SLINK: an optimally efficient algorithm for the single link cluster methods. Comput. J.16,30-34.1973.
    [20]Anderberg,M.,Cluster Analysis for Applications.Academic Press,New York.1973.
    [21]Jain,A.K.,Dubes,R.C.,Algorithms for Clustering Data.Prentice-Hall,Englewoods 1988.
    [22]Cliffs,NJ.Johnson,S.,Hierarchical clustering schemes.Phychometrika 23,241-254.1967
    [23]Zhang,T.,Ramakrishnan,R.,Livny,M.,BIRCH:an efficient data clustering method for very large databases.In:SIGMOD'96,Montreal,Canada,pp.103-114.1996.
    [24]G.Karypis,E-H.Han,V.Kumar,CHAMELEON:a hierarchical clustering algorithm using dynamic modeling,IEEE Computer 32(1999) 68-75.
    [25]Prim,R.C.,Shortest connection networks and some generalizations.Bell Syst.Tech.J.36,389-1401.1957.
    [26]Rasmussen,E.M.,Willett,P.,Efficiency of hierarchical agglomerative clustering using the ICL distributed array processors.J.Document.45(1),1-24.1989.
    [27]Piftzner,D.W.,Salmon,J.K.,Sterling,T.,Halo world:tools for parallel cluster finding in astrophysical N-body simulations.Data Mining Knowledge Discovery 1(4),419-438.1997.
    [28]Provost,E,Kolluri,V.,Scaling up inductive algorithms:an overview.In:Proc.of the 3rd Internat.Conf.on Knowledge Discovery and Data Mining,pp.239-242.1997.
    [29]陈国良.并行算法的设计与分析.北京:高等教育出版社,2002.
    [30]Li,K L,Li Q.H,Dai G.M.An adaptive algorithm for the knapsack problem.Journal of Computer Development and Research,12(7):1024-1029,2004.
    [31]S.Rajasekaran,"On the Euclidean Minimum Spanning Tree Problem," technical report BECAT/CSE,Univ.of Connecticut,2004.
    [32]M a Jun and M a Shaohan.Effcient Parallel Algorithm s for Some Graph Theory Problems Vol.8No.4 J.of Comput.Sci.& Technol 1993.
    [33]Li,K L,Li Q.H,Li,R.E Optimal parallel algorithm for the knapsack problem without memory conflicts.Journal of Computer Science and Technology.2004,19(6):760-768
    [34]Li,K L,Li,R.F,Li Q.H,A Parallel Three-List Algorithm for the Knapsack Problem without Memory Conflicts,Chinese journal of computers.2006,29(2):345-352
    李肯立、李仁发李庆华等,背包问题无存储冲突的并行三表算法,计算机学报,2006,29(2):345-352
    [35]H.R.Tsai,S.J.Horng,S.S.Lee,S.S.Tsai,and T.W.Kao,"Parallel Hierarchical Clustering Algorithms on Processor Arrays with a Reconfigurable Bus System," Pattern Recognition,vol.30,pp.801.-815,1997.
    [36]H.Gazit,"An Optimal Randomized Parallel Algorithm for Finding Connected Components in a Graph," SIAM J.Computing,vol.20,no.6,pp.1046-1067,1991.
    [37]C.-H.Wu,S.-J.Horng,and H.--R.Tsai,"Efficient Parallel Algorithms for Heirarchical Clustering on Arrays with Reconfigurable Optical Buses," J.Parallel and Distributed Computing,vol.60,pp.1137-1153,2000.
    [38]Dubes,R.C.,Jain,A.K.,Clustering methodologies in exploratory data analysis.Adv.comput.19,113-228.1980.
    [39]Ester,M.,Kriegel,H.P.,Sander,J.,Xu,X.,A densitybased algorithm for discovering clusters in large spatial databases with noise.In:Proc.of the 2nd Internat.Conf.on Knowledge Discovery and Data Mining,pp.226-231.1996.
    [40]Vrahatis,M.N.,Boutsinas,B.,Alevizos,P.,Pavlides,G.,.The new k-windows algorithm for improving the k-means clustering algorithm.J.Complexity.2002.
    [41]B.Chazelle,"A Minimum Spanning Tree Algorithm with Inverse-Ackerman Type Complexity," J.ACM,vol.47,no.6,pp.1028-1047,2000.
    [42]D.R.Karger,P.N.Klein,and R.E.Tarjan,"A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees," J.ACM,vol.42,no.2,pp.321-328,1995.
    [43]Ka wong chong,Yijie han,Yoshihide,T.W.lam,"Improve then efficiency of parallel minimum spanning tree algorithms",Discrete Aplplied Mathematics 126(2003)33-54.
    [44]Driscoll,J.R.,Gabow,H.N.,Shrairman,R.,Tarjan,R.E.,Relaxed heaps:an alternative to Fibonacci heaps with applications to parallel computation.Commun.ACM 31(11),1343-1354,1988.
    [45]Olson,C.E,Parallel algorithms for hierarchical clustering.Parallel Comput.21,1313-1325.1995.
    [46]MacQueen,J.B.,Some methods for classification and analysis of multivariate observations.In:Proc.of the 5th Berkeley Symposium on Mathematics Statistics and Probability,pp.281-297.1967.
    [47]罗会兰,孔繁胜,杨小兵,刘必红,基于数学形态学的聚类分析,模式识别与人工智能,vol.19,727-733,2006.
    [48]王凯,贺国平,候伟真,改进的模糊核C-均值算法,微电子学与计算机,vol.23,141-143,146,2006。
    [49]高尚,汤可宗,杨静宇,一种新的基于混合蚁群算法的聚类方法,vol.23,38-40,2006。
    [50]周晓云孙志挥张柏礼杨宜东,高维数据流聚类及其演化分析研究,计算机研究与发展,vol.43,2005-2011,2006.
    [51]赵东东宗瑜江贺张宪超,一种多空间聚类算法,小型微型计算机系统,vol.27,2297-2300,2006.
    [52]黄添强,秦小麟,王金栋,多代表点特征树与空间聚类算法,计算机科学,vol.33,189-195,2006.
    [53]刘建骅,李芳,一种基于密度的高性能增量聚类算法,计算机工程,v01.32,76-78,2006.
    [54]刘青宝,候东方,邓苏,张维明,基于相对密度的增量式聚类算法,国防科技大学学报,vol.28,73-79,2006.
    [55]杨莉,高维数据的可视化和快速聚类算法,计算机科学,vol.33,132-133,2006.
    [56]刘峻岭,孙焕良,王大玲,牛志成,一种优化的基于网格的聚类算法,小型微型计算机系统。vol.27,1927-1930,2006.
    [57]陈志平,王雷,李志成,基于密度梯度的聚类算法研究,计算机应用,vol.26,2389-2392,2006.
    [58]胡文瑜,孙志辉,周晓云,基于相异性选择的密度聚类算法研究,小型微型计算机系统,vol.26,1601-1604,2006.
    [59]修宇,王士同,吴锡生,胡德文,方向相似性聚类算法DSCM,计算机研究与发展,vol.43,1425-1431,2006.
    [60]于林森,张田文,基于视觉与标注相关信息的图像聚类算法,电子学报,vol.34,1265-1269,2006.
    [61]任江涛,施萧萧,孙倆吴,黄焕宇,印鉴,一种改进的基于特征赋权的 K-均值聚类算法,计算机科学,vol.33,186-187,2006.
    [62]赵亚琴,周献中,何新,王建宇,一种有效的高属性维稀疏数据聚类算法,模式识别与人工智能,vol.19,289-294,2006.
    [63]张建华,江贺,张宪超,蚁群聚类算法综述,计算机工程与应用,vol.42,171-174,2006.
    [64]陈宗海,文峰,聂建斌,吴晓曙,基于节点生长K-均值聚类算法强化学习方法,计算机研究与发展,vol.43,661-666,2006.
    [65]甄彤,基于层次与划分方法的聚类算法研究,计算机工程与应用,vol.42,178-180,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700