基于密度的并行聚类算法研究

英文题名：Studying on Parallel Clustering Algorithms Based on the Density
作者：毛韶阳
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：并行算法 ; 聚类算法 ; 消息传递 ; 基因序列 ; 密度函数
英文关键词：Parallel Algorithm ; Clustering Algorithm ; Message Passing ; Gene Sequence ; Density Function
学位年度：2007
导师：李肯立
学科代码：081203
学位授予单位：湖南大学
论文提交日期：2007-05-14
答辩委员会主席：文双春

摘要

随着现代生物技术的不断发展特别是基因组计划的实施,人们不断的获得大量基因序列数据,互联网上的基因数据正呈指数增长,这些内涵丰富的数据为人们分析和研究基因的组成与功能之间的关系提供了基础。现代信息技术的发展尤其是超级计算机的飞速发展所带来的高速计算能力正引导着算法研究者们不断研究出新的并行聚类算法,以解决高维海量基因序列数据的计算问题。大量事实说明,一个准确、高效的并行聚类算法对生物计算尤其是基因序列数据计算的影响力是不可估量的。
     本文首先对目前的几种典型的串行聚类算法就适用数据属性范围、时间复杂度等方面进行了分析,提出了对基因序列数据采用基于密度聚类的观点,提出了一种和基因序列数据相匹配的密度函数计算方法及一个相适应的邻域半径计算公式。
     通过对并行计算模型的研究,设计了一种基于密度的并行聚类算法,通过3次时间复杂度为O(n~2/P)的并行运算,能使并行聚类过程的时间复杂度变为O(n/P)。比较传统的基于密度的聚法算法而言,增加了一次计算,以增加一次计算为代价来减少计算机操作上的开销。
     最后在计算机群上对本文所提算法进行了验证,实验结果表明:此算法对高维海量基因序列数据有着很好的聚类效果,簇内数据收敛度高,展示了良好的时间优越性。
With the continuous development of modern biology technology, especially the implement of the Human Genome Project, people have acquired quantities of gene sequence data, the gene's data in the Internet is presenting exponential increase, which supplies basis for people's analyzing and research the relationship of gene's composing and functions. The development of the modern information technical especially the super computer has brought high-speed compute ability, it can guide the researchers to find new clustering algorithm for the high dimension thousands gene sequence data analyses. Lots of experiments show that an accurate and efficient parallel algorithm is impossible to estimate the influence to the biology compute especially to the gene sequence data.
     Firstly, this paper analyzed some typical serial clustering algorithm about data property and the time complexity, put forward a point that the gene sequence data clustering may base on the density, and propose a method of computing about the density function and the neighbor area radius.
     Secondly, this paper studied the parallel algorithm model, and designed a parallel clustering algorithm based on the density, it can make the parallel clustering time complexity into O(n/P) through three time computing with the time complexity of O(n~2/P) . Compare to the traditional clustering algorithm, it added once compute. Take adding once calculation as price to reduce the processing expense.
     Finally, validated this algorithm on computer clusters, the experiment show that the parallel clustering algorithm has efficient cluster ability on the high dimension thousands gene sequence data, has strong convergence in one cluster and displaying good time superiority.

引文

[1] P C 温特, G L 弗莱伽.遗传学[T].谢雍译.北京.科学出版社, 2001,7-10,302-308
    [2] M Beisen, P T Spellman, P O Brown et al. Cluster analysis and display of genome- wide expression patterns[J].Proc Natl Acad Sci, USA,1998; 95: 14863-14868
    [3] P Tamayo et al. Interpreting patterns of gene expression with self organizing maps[J]. In proceedings of National Academy of Science, 1999; 96(6): 2907-2912
    [4] J Herrero et al. A hierarchical unsupervised growing neural network for clustering gene expression patterns[J]. Bioinformatics, 2001; 17(2) :126-136
    [5] Y Yung, W Ruzzo. An empirical study on principal component analysis for clustering gene expression data[J]. Bioinformatics, 2001; 17( 9) :763-774
    [6] C H Q Ding, X He, H Zha etc. Adaptive dimension reduction for clustering high dimensional data[C].In: Proceedings of the 2002 IEEE International Conference on Data Mining, 2002
    [7] H Wang, W Wang, J Yang et al. Clustering by pattern similarity in large data sets [C]. In : Proceedings of ACM SIGMOD International Conference on Management of Data, 2002
    [8] X V Olman, D X. Clustering gene expression data using a graph theoretic approach: application of minimum spanning trees[J]. Bioinformatics,2002; 18( 4) : 536-545
    [9] T Allyn, M Robert, et al. An overview of clustering algorithms[A]. Proceedings of SPIE, The International Society for Optical Engineering [C]. 2001(4367):41-51.
    [10] S Mavroudi, S Papadimitriou, A Bezerianos. Gene expression data analysis with a dynamically extended self - organized map that exploits class information[J]. Bioinformatics, 2002, 18:1446-1453.
    [11] 赵国屏等.生物信息学[M].北京:科学出版社,2002:118-124
    [12] M Schena, D Shalon, R W Davis, P O Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray[J]. Science, 1995; 270( 5235) : 467-470
    [13] D J Lockhart, E L Brown,G G Wong,M S Chee,T R Gingeras. Expressionmonitoring by hybridization to high density oligonucleotide arrays[J]. N at Biotech, 1996; 14( 13) : 1675-1680
    [14] 王富刚,陈先农. 基因芯片数据的聚类分析[J]. 国外医学生物医学工程分册, 2004; 27( 2): 98-101
    [15] JiaWei Han, Micheline Kamber. 数据挖掘概念与技术[M]. 范明,孟小峰译. 北京: 机械工业出版社, 2001-8,55-88.
    [16] S Guha, R Rastogi, K Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, WA, 1998:73-84.
    [17] G Karypis, E H Han, V Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling, IEEE Computer, 1999,32: 68-75.
    [18] G J McLachlan, T Krishnan. The EM Algorithm and Extensions. New York, NY: J Wiley, Sons,1997, http://crn/bell-labs.com/cm/ms/departments/sia/project/em/.
    [19] C Wallace, D Dowe. Intrinsic classification by MML-the Snob program. In the Proc. of the 7th Australian Joint Conference on Artificial Intelligence, UNE, Armidale, Australia, World Scientific Publishing Co.,1994:37-44.
    [20] U M Fayyad, G P Shapiro, P Smyth, R Uthurusamy. Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press,1996: 95-164
    [21] C Fraley, A Raftery. MCLUST: Software for model-based cluster and discriminant analysis, Tech. Report 342, Dept. Statistics, Univ. of Washington, 1999
    [22] M C Su, C H Chou. A modified version for k-means[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23 (6 ) : 674 -680.
    [23] P S Bradley, U M Fayyad. Refining initial points for k-means clustering. In Proc. of the 15th ICML, Madison, WI, 1998: 91-99.
    [24] A Likas, N Vlassis, J Verbeek. The global k-means clustering algorithm. http://iris.usc.cduVision-Notes/bibliography/pattern623.html, 2003:451-461.
    [25] G P Babu, M N Murty. A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm. Pattern Recognition, 1993, 14(10):763-769
    [26] D Brown, C Huntley. A practical application of simulated annealing to clustering. Technical Report IPC-TR-91-003, University of Virginia, 1991
    [27] B Zhang. Generalized k-harmonic means-dynamic weighting of data in unsupervised learning. In Proc. of the 1st SIAM International Conference on Data Mining. Chicago, IL, 2001:1-13.
    [28] D Pelleg, A Moore. X-means: Extending K-means with efficient estimation ofthe number of clusters. In Proc. 17th ICMG, Stanford University, 2000: 89-97
    [29] 刘健庄,谢维信等. 聚类分析的遗传算法[J]. 电子学报,1995,23(11):81-83.
    [30] 李碧,雍正正. 一种改进的基于遗传算法的聚类分析方法[J]. 电路与系统学报, 2002,7(3):96-99.
    [31] 刘静 , 钟伟才 , 刘芳 , 焦李成 . 免疫进化聚类算法 [J]. 电子学报,2001,29(12A):1868-1872.
    [32] Chazelle. “A Minimum Spanning Tree Algorithm with Inverse-Ackerman Type Complexity,”[J]. ACM, vol. 47, no. 6, pp. 1028-1047, 2000.
    [33] 钱云涛,谢维信. 一种由模糊逻辑神经元网络实现的聚类分析方法[J]. 西安电子科技大学学报,1995,22(l): 1-7.
    [34] 钱云涛,谢维信. 聚类神经网络的通用设计方法[J]. 西安电子科技大学学报,1997;24(1):15-21.
    [35] 黄敏超,张育林,陈启智. 模糊超球神经网络在模式聚类中的应用[J]. 自动化学报.1997, 23(2): 279-282
    [36] 魏立梅,谢维信. 聚类分析中竞争学习的一种新算法[J]. 电子科学学刊,2000, 22(1): 13-18.
    [37] 黄风岗,宋克欧. 一种集成模糊聚类神经网络[J]. 哈尔滨工程大学学报,1997, 18(3): 82-85
    [38] S Kirkpatrick, C D Gelatt, M P Vecchi. Optimization by simulated annealing[J]. Science, 1983, 220:671-680.
    [39] S L Chiu. Fuzzy Model identification based on cluster estimation. Journal of Intelligent and Fuzzy systems, 1994, 2(3):267-278
    [40] 陈国良. 并行算法的设计与分析[M]. 北京:高等教育出版社, 2002.
    [41] J C Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms[M]. New York: Plenum Press, 1987
    [42] H Wang, W Wang, J Yang et al. Clustering by pattern similarity in large data sets[C]. In: Proceedings of ACM SIGMOD International Conference on Management of Data, 2002
    [43] H R Tsai, S J Horng, S S Lee, S S Tsai, and T W Kao. “Parallel Hierarchical Clustering Algorithms on Processor Arrays with a Reconfigurable Bus System,” Pattern Recognition, vol. 30, pp. 801-815, 1997.
    [44] C F Olson. Parallel Algorithms for Hierarchical Clustering, Parallel Computing, vol. 21, pp. 1313-1325, 1995
    [45] X Li. Parallel Algorithms for Hierarchical Clustering and Clustering Validity, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, pp. 1088-1092,1990.
    [46] C H Wu, S J Horng, H R Tsai, “Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses,”[J] Parallel and Distributed Computing, vol. 60, pp. 1137-1153, 2000.
    [47] Sanguthevar Rajasekaran. Efficient Parallel Hierarchical Clustering Algorithms. IEEE transactions on parallel and distributed systems, vol. 16, no.6, June 2005:497-502
    [48] X Li, Z Fang. “Parallel Clustering Algorithms,” Parallel Computing, vol. 11, pp. 275-290, 1989.
    [49] M Schena, D Shalon, R W Davis et al. Quantitative monitoring of gene expression patterns with a complementary DNA micro array[J]. Science, 1995, 270 (5235):467-470.
    [50] L K Li, Q H Li, R F Li. Optimal parallel algorithm for the knapsack problem without memory conflicts. Journal of Computer Science and Technology. 2004,19(6): 760-768
    [51] I Chaudhur, I B Chaudhur. A novel multi-seed nonhierarchical data clustering technique[J]. IEEE Transactions on Systems, Man and Cybernetics: Part B, 1997,27(5): 871-877.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700