K-均值聚类算法的研究与改进

英文题名：Research and Improvement on K-Means Clustering Algorithm
作者：欧陈委
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：聚类算法 ; K-均值算法 ; 差分进化算法
英文关键词：Clustering algorithms ; K-means algorithm ; Differential evolution algorithm
学位年度：2011
导师：陈曦
学科代码：081203
学位授予单位：长沙理工大学
论文提交日期：2011-03-01
答辩委员会主席：车生兵

摘要

随着计算机技术的飞快发展,人们每天都会面临诸如文本、图像、音频、视频等各种形式的数据,这些数据的数量是极其庞大的,如何快速有效地从这些海量数据中提炼出其间所隐含的有价值的信息,成为人们十分关注且亟待解决的问题。数据挖掘(Data Mining,DM)由此而诞生。它为人们解决这个问题提供了许多卓有成效的方法和工具。聚类分析就是其中最为重要的方法之一,它是数据挖掘技术的重要组成部分。随着近年来对聚类分析技术的研究逐渐深入,其重要性已越来越得到人们的认可。近年来,无论在理论方面还是在实际应用方面,聚类分析技术的研究都取得了丰硕的成果。目前,聚类分析技术已在机器学习、模式识别、图像处理、文本分类、市场营销及统计科学等领域得到了广泛的应用。
     根据数据类型、聚类目的及应用的不同,目前已有的聚类算法大致可以分为以下几种:划分的算法、层次的算法、基于网格的算法、基于密度的算法以及基于模型的算法。其中,研究最为成熟最为经典的就是基于划分的K-均值聚类算法。本文深入研究和分析了K-均值聚类算法的优缺点,并针对其聚类结果易受初始中心影响的特点,对K-均值聚类算法进行了改进。本文所做的主要工作有:
     1.针对K-均值聚类算法对初始聚类中心存在依赖性的缺陷,本文提出一种新的选取K-均值聚类算法初始聚类中心的方法,实验表明,该方法可有效解决由于初始聚类中心选取的过于邻近而导致聚类结果不稳定的问题,提高了聚类结果的有效性和稳定性。
     2.针对K-均值聚类算法存在对初始中心的选择敏感且易陷入局部最优解的缺点,本文将全局寻优能力强的差分进化算法引入聚类中。本文提出了一种改进的差分进化算法,并将改进的差分进化算法和K-均值聚类算法相结合,较好地解决了K-均值聚类算法初始中心的优化问题,实验表明,该方法有效提高了聚类质量和收敛速度。
With the rapid development of computer technology, people face all kinds of data, such as text data, image data, audio data, video data and so on. The quantity of these kinds of data is very large. How to quickly and effectively gain implicit and valuable information from these mass data has been a problem that has got much attention and should been solved urgently. Data mining (DM) has appeared in this situation. It has provided lots of efficient methods and tools on solving that problem for people. The Clustering analysis is one important method of them. It is an important part of data mining. With the gradually intensive research on clustering analysis these years, its importance has been recognized by people more and more. Clustering analysis technology has gained plentiful and substantial achievements in both theory and practice during recent years. At present, clustering analysis has been widely applied in machine learning, pattern recognition, image processing, text classification, marketing, statistical science and lots of others fields.
     According to the difference of data type, clustering purpose and application, we can divide existing clustering algorithms into partition algorithm, hierarchical algorithm, grid-based algorithm, density-based algorithm and model-based algorithm. One of the most mature and classical clustering algorithms is k-means clustering algorithm. It is a partition algorithm. This paper presents deeply research and analysis on merits and defects of k-means clustering algorithm. This paper has provided a improvement on k-means clustering algorithm according to the feature that the results of k-means clustering algorithm liable to be effected by initial centers. Following are the main works have been done:
     1. According to the defect that K-means clustering algorithm is dependent on the initial clustering centers selection, this paper put forward a new initial clustering centers selection method of k-means algorithm. The experiments showed that this method has effectively solved the problem that the clustering result is always unstable due to the initial clustering centers overly close to each other and has improved effectiveness and stability of the clustering result.
     2. Aiming to the disadvantages of k-means clustering algorithm that it is sensitive to the initial centers selection and easily falls into local optimal solution, differential evolution algorithm whose global optimization ability is strong was introduced into clustering in this paper. This paper put forward an improved differential evolution algorithm and made it combined with k-means clustering algorithm at the same time. This method has solved initial centers optimization problem of k-means clustering algorithm well. The experiments showed that the method has effectively improved clustering quality and convergence speed.

引文

[1]纪希禹,韩秋明,李微,等.数据挖掘技术应用实例.北京:机械工业出版社,2009
    [2] Jiawei Han, Micheline Kamber.数据挖掘概念与技术.范明.北京:机械工业出版社, 2007
    [3] Jain AK, Flynn PJ. Image segmentation using clustering. In: Ahuja N, Bowyer K, eds. Advances in Image Understanding: A Festchrift for Azriel Rosenfeld. Piscataway: IEEE Press, 1996. 65-83
    [4] Cades I, Smyth P, Mannila H. Probabilistic modeling of transactional data with applications to profiling, visualization and prediction, sigmod. In: Proc. of the 7th ACM SIGKDD. San Francisco: ACM Press, 2001. 37-46
    [5] Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Computing Surveys, 1999,31(3):264-323
    [6]张建辉.K-means聚类算法研究及应用:[武汉理工大学硕士学位论文].武汉:武汉理工大学,2007
    [7]吴晓蓉.K-均值聚类算法初始中心选取相关问题的研究:[湖南大学硕士学位论文].长沙:湖南大学, 2008
    [8]冯超.K-means聚类算法的研究:[大连理工大学硕士学位论文].大连:大连理工大学,2007
    [9]曾志雄.一种有效的基于划分和层次的混合聚类算法.计算机应用,2007,27(7):1692-1695
    [10]范光平.一种基于变长编码的遗传K均值算法研究:[浙江大学硕士学位论文].杭州:浙江大学,2007
    [11]孙士保,秦克云.改进的K-平均聚类算法研究.计算机工程,2007,33(13):200-202
    [12]孙可,刘杰,王学颖.K均值聚类算法初始质心选择的改进.沈阳师范大学学报,2009,27(4):448-450
    [13]徐义峰,陆春明,徐云青.一种改进的K-均值聚类算法.计算机应用与软件,2008,25(3):275-277
    [14]汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法.模式识别与人工智能,2009,22(2):299-304
    [15]刘靖明,韩丽川,侯立文.基于粒子群的K均值聚类算法.系统工程理论与实践,2005,25(6):54-58
    [16] Jain AK,Duin Robert PW,Mao JC.Statistical pattern recognition:A review.IEEE Trans.Actions on Pattern Analysis and Machine Intelligence,2000,22(1):4-37
    [17] Sambasivam S,Theodosopoulos N.Advanced data clustering methods of mining web documents.Issues in Informing Science and Information Technology,2006,8(3):563-579
    [18]李金宗.模式识别导论.北京:高等教育出版社,1994
    [19] T.zhang, R.Ramakrishnan, and M.Livny.BIRCH:An efficient data clustering method for very large databases. ACM SIGMOD Record, 1996, 125(2): 103-114
    [20] S. Guha, R . Rastogi, K .Shim .CURE: An efficient clustering algorithm for large databases. In: Proc of the 1998 ACM SIGMOD international conference on Management of data.New York,1998,73-84
    [21] S. Guha, R . Rastogi , K.Shim.A robust clustering algorithm for categorical attributes.In:Proc of the 15th International Conference on Data Engineering.Washington,DC,1999,512-521
    [22] G. Karypis, E.H. Han, V. Kumar. CHAMELEON: A Hierarchica Clustering Algorithm Using Dynamic Modeling.IEEE Trans on Computer,1999,32(8):68-75
    [23] W. wang, J. Yang. R. R. Muntz.STING:A statistical information grid approach to spatial data mining.In:Proc of the 23rd Conference on VLDB.Athens, Greece, 1997, 186-195
    [24] G. Sheikholeslami, S. Chaterjee, A. zhang. WaveCluster, A Multi—Resolution Clustering Approach for Very Large Spatial Database. In: Proc of the 24th Conf on V LDB, New York, 1998, 428-439
    [25] Rakesh A, Johanners G, Prabhakar R, Automatic subspace clustering of high dimensional data for data mining application. In:Snodgrass RT, Winslett M, eds. Proc of the 1994 ACM-SIGMOD Int.conf.on Management of Data Minneapolis:ACM Press,1994,94-105
    [26] M.Ester,H.P.Kriegel,J.Sander,X.Xu.A density-based algorithm for discovering clusters in large spatial databases.In:Proc of the 2th Int’1 Conf on Knowledge Discovery and Data Mining.Portland,1996,226-231
    [27] M.Ankerst,M.Breuning,H.P.Kriegel,et a1.OPTICS:Ordering Points to Identify Clustering Structure. In: Proc of the ACM SIGMOD Conference on Management of Data. Philadelphia,1999,49-60
    [28] A.Hinneburg,D.Keim.An efficient approach to clustering large multimedia database with noise. In:Proc of the 4th ACM SIGKDD on Knowledge Discovery and Data Mining. Seattle,1998,58-65
    [29]毛国君,段丽娟,王实,等.数据挖掘原理与算法(第二版).北京:清华大学出版社,2007
    [30]但汉辉,张玉芳,张世勇.一种改进的K-均值聚类算法.重庆工商大学学报,2009,26(2):144-147
    [31]向永生,张颖,陈曦,等.基于改进GA的K-均值聚类算法.长沙理工大学学报,2009,6(1):73-76
    [32]宋凌,李枚毅,李孝源.基于粒群优化的K均值算法及其应用.计算机工程,2008,34(16):201-203
    [33]孙秀娟.基于遗传算法的K-means聚类算法分析研究:[山东师范大学硕士学位论文].济南:山东师范大学,2009
    [34]石陆魁,柳冰,沈雪勤.一种基于非线性降维的聚类算法.河北工业大学学报,2005,34(1):112-114
    [35]黄志红.基于层次聚类的K均值算法研究.电脑开发与应用,2009,22(7):1-5
    [36]吴艳文,胡学钢.一种K-means算法的K值优化方案.巢湖学院学报,2007,9(6):21-24
    [37]徐宗本.计算智能——模拟进化计算.北京:高等教育出版社,2004
    [38] Storn R, Price K. Differential Evolution-a Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Space.Technical Report, International Computer Science Institute, Berkley,1995(18):22-25
    [39]刘凤龙,陈曦,曹敦,等.基于差分演化的K-均值聚类算法.计算技术与自动化,2010,29(1):48-50
    [40] Storn.R.Differential evolution design of an IIR-filter.In:IEEE Int Conf on Evolutionary Computation.Nagoya,1996,173-268
    [41] Storn.R.On the usage of differential evolution for function optimization.In:Biennialconference of the North American fuzzy information processing society,1996,519-523
    [42] Price K. Differential Evolution: A Fast and Simple Numerical Optimizer.In: 1996 Biennial Conference of the North American,Fuzzy Information Processing Society.New York,1996,524-527
    [43] Liu J,Lampinen J.A fuzzy adaptive differential evolution algorithm.Soft Computing 2005,9(6):448-462
    [44] Wu Zhi-feng,Huang Hou-kuan,Yang Bei,et a1.A modified differential evolution algorithm with self-adaptive control parameters. Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering,2008:524-527
    [45]邓泽喜.一种新的差分进化算法.计算机工程与应用,2008,44(24):40-42
    [46]吴亮红,王耀南,袁小芳,等.自适应二次变异差分进化算法.控制与决策,2006,21(8):888-902
    [47]邓泽喜,刘晓冀.差分进化算法的交叉概率因子递增策略研究.计算机工程与应用,2008,44(27):33-36
    [48]毛润宇,王小平,薛小平.一种自适应差分演化算法.计算机应用与软件,2008,25(12):24-26
    [49] Wei Yi-qian,A jun Li.Adaptive differential evolution algorithm for multi-objective optimization problems,Applied Mathematics and Computation,2007
    [50] Wen yin Gong,Zhi hua Cai.A Multi-objective Differential Evolution Algorithm for Constrained Optimization.2008 IEEE Congress on Evolutionary Computation(CEC 2008)
    [51] V.L.Huang,A.K Qin,Member,Senior Member.Self-adaptive Differential Evolution Algorithm for Constrained Real-Parameter Optimization.In: 2006 IEEE Congress on Evolutionary Computation.Sheraton Vancouver Wall Center Hotel,Vancouver,BC,Canada.July 16-21,2006
    [52] Karin Zielinski, Rainer Laur. Constrained Single-Objective Optimization Using Differential Evolution 2006 IEEE Congress on Evolutionary Computation July 16-21,2006
    [53] Swagatam Das,Ajith Abraham.Automatic Clustering Using an Improved Differential Evolution Algorithm.IEEE Systems and Humans, January ,2008,(38):218-237
    [54] Swagatam Das,Amit Konar.Automatic image pixel clustering with an improveddifferential evolution.Applied Soft Computing.2008,9(2009):226-236
    [55]李颖,徐桂之,饶利芸,等.微分进化算法在头部电阻抗成像中的应用.中国生物医学工程学报,2005,24(6):672-675
    [56]李聪明.基于差分算法的K-均值聚类分析.现代计算机,2008,(285):67-69
    [57]姜立强,郭铮,刘光斌.差分进化算法缩放因子取值策略研究.仪器仪表学报,2007,28(8):508-510

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700