聚类算法及其有效性问题研究

英文题名：Study of Clustering Algorithm and Its Validity Problem
作者：刘明术
论文级别：硕士
学科专业名称：计算数学
中文关键词：分层聚类 ; 马氏距离 ; 聚类有效性 ; 数据挖掘
英文关键词：Hierarchical clustering ; Mahalanobis distance ; Clustering validity ; Data
英文关键词：mining
学位年度：2012
导师：方宏彬
学科代码：070102
学位授予单位：安徽大学
论文提交日期：2012-04-01

摘要

聚类算法是数据挖掘中重要的研究领域,聚类有效性是根据聚类理论方法能判别聚类质量高低的指标.。聚类有效性验证方法主要有基于内部或外部准则的统计假设检验,聚类层次的有效性,单独聚类的有效性,Dunn和类Dunn指标,Davies-Bouldin和类DB指标,Gap统计等。聚类算法常见的有分层聚类算法、网格聚类算法、基于密度聚类算法、基于划分的聚类算法、其它聚类算法等。但这些算法常常采用欧氏距离来度量相似性的,而欧氏距离将样品的不同属性之间的差别等同看待,易受变量之间的相关性干扰,不仅影响聚类的速度和质量,还影响聚类有效性指标的性能,有时不能满足实际要求。另一方面,对两点之间进行距离度量的马氏距离具有很多优点,如它不受量纲的影响,两点之间的马氏距离与原始数据的测量单位无关,由标准化数据和中心化数据计算出的二点之间的马氏距离相同,马氏距离还可以排除变量之间的相关性的干扰。本文探讨了分层聚类算法和欧氏距离的局限性,充分考虑数据的几何结构特征和个体属性,结合马氏距离提出了一种新的属性相似性度量方法及新的聚类有效性函数,并对采用欧氏距离的分层聚类算法进行了改进,实验表明改进算法具有一定的优越性。
Clustering algorithm is an important research field In data mining and clustering validity which is based on clustering theory method is a discrimination index of clustering quality. Clustering validity methods mainly include the statistical hypothesis based on internal or external criteria, effectiveness of clustering level, separate clustering effectiveness, Dunn and class Dunn index., Davies-Bouldin and class DB index as well as Gap statistics, etc. Clustering algorithms are presented familiarly such as hierarchical clustering algorithm, the grid clustering algorithm, clustering algorithm based on density, and clustering algorithm based on the classification, on the other hand Euclidean distance is used to measure the similarity of different samples in these algorithms. Euclidean distance has some manifest disadvantages such as indiscrimination of samples of different attributes and easy interference of correlation between variables, so that it sometimes can not meet the actual demands because of clustering speed and quality as well as clustering validity index of performance. On the other hand, the Mahalanobis distance owns some advantages such as no influence of dimension, namely no relation between the Mahalanobis distance of two samples and the original data measure, the same value about normalized data and the center of the data basing on Mahalanobis distance of two points, and eliminating interference of between variables.
Limitation of the hierarchical clustering algorithm and Euclidean distance is discussed in this paper, moreover a new algorithm of attributes similarity measure and new clustering validity function are presented with respect to Mahalanobis distance in light of the characteristics of data geometrical structure and individual attributes, at the same time hierarchical clustering algorithm basing on Euclidean distance is improved. The improved clustering algorithm is valid by experiments.

引文

[1]陆云.聚类分析数据挖掘方法的研究与应用[D],合肥：安徽大学,2007.4.
    [2]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1)：9.
    [3]贺玲,吴玲达,蔡益朝.高维空间中数据的聚类算[J],数学的实践与认识,2006,36(9)：189-194.
    [4]曾华,吴耀华,黄顺亮.非均匀类簇密度聚类的多粒度自学习算法[J].系统工程与电子技术.2010年8月第32卷第8期.
    [5]贾晨科,邱保志.基于K-means距离的孤立点和聚类算法研究[D].郑州大学硕士学位文,2006.
    [6]王安志,李明东,李超.各种聚类算法及改进算法的研究[J]电脑知识与技术.2008年第3卷第7期,1539-1540.
    [7]李明华,刘全,刘忠.数据挖掘中聚类算法的新发展[J]计算机应用研究,2008,25(1),60-62.
    [8]Wu B, Shi Z. A. clustering algorithm based on swarm intelligence[C].In:Proceedings IEEE international conferences on info-tech & info-net proceeding. Beijing.2001.
    [9]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1)：9.
    [10]Breuning M.M,Kriegel H.-P,Ng R.T,and Sander J."LOF:IdentifYing densit Y-based localo-utliers"[C].In proceeding of ACM SIGMOD Inter-national Conference on Management of Data, Dalles,Texas,USA,2000.
    [11]VogeL M A, Wong A C,PFS clustering Method[J].IEEETrans on PAMI,1979,1(3):237-245.
    [12]留待任,孙焕良,牛志成,朱叶丽.一种新的基于密度的聚类与孤立点检测算法[J].沈阳建筑大学学报,vo1.22,No.1(2006).
    [13]方杰,朱京红.日志挖掘中的数据预处理[J].计算机技术与发展,2012,(04).
    [14]L.I.Ezeife,Yilu,Ming Web Log sequential Patterns With Position.Codedpre-Order Linked WAP-Tree[J]. Data.Mining and knowledge Disscorvery,2005,10(1).
    [15]张昊,王琪洁,朱建军,张晓红.样本数据预处理对基于BP神经网络的GPS高程拟合的影响[J].大地测量与地球动力学,2011,(02).
    [16]赖桃桃,冯少荣.聚类算法中的相似性度量方法研究[J],心智与计算,Vo1.2,No.2(2008),176-181.
    [17]刘韬,蔡淑琴,曹丰文,等.基于距离浓度的K-均值聚类算法[J].华中科技大学学报：自然科学版,2007,35(10)：50-52.
    [18]安静,姜青山.一种判断聚类有效性的新指标[J].计算机研究与发展,2005,42(S)：38-42.
    [19]Bezdek JC, Pattern Recognition.with Fuzzy objective Fuction algorithms[M]. New York: Plenums press,1981.
    [20]Halkidim, Batistakisy, Vazirgiannism on Clustering Validation techniques[J]. Intelligent Information Systems 2001,17(2-3):107-145.
    [21]范九伦.基于包含度的聚类有效性函数[J].西安邮电学院学报,1999,4(1)：84-86.
    [22]范九伦.模糊熵理论[M].西安：西北大学出版社,1999.
    [23]孟令奎,胡春春.基于模糊划分测度的聚类有效性指标[J].计算机工程,2007,(11).
    [24]彭慧伶.基于最佳分类数和粗糙集的故障诊断方法[J].计算机与数字工程,2010,(05).
    [25]Bezdek JC,PalNR. Some new indexes of cluster validity[J].IEEE Tattern. SMC-B,1998,28(3): 301-315.
    [26]王圆妹.一种改进的K-means聚类算法的研究[J].长江大学学报：自科版,2006,3(4)：76-77.
    [27]朱灏东,钟勇,赵向辉.一种优化初始中心点的K-means文本聚类算法[J].郑州大学学报：理学版,2009,41(6)：29-32.
    [28]周爱武,余亚飞K-means聚类算法研究[J].计算机技术与发展,2011,(02).
    [29]汤寒青,王汉军.改进的K-means算法在网络舆情分析中的应用[J].计算机系统与工程.
    [30]王嫻,杨绪兵,周宇,周溜溜.一种基于类中心矫正的的层次聚类算法[J].微电子学与计算机,2011(10).
    [31]周世兵,聚类分析中的最佳聚类数确定方法研究及应用[D].江南大学,2011.
    [32]石剑飞,闫怀志,牛占云.基于凝聚的层次聚类算法的改进[J].北京理工大学学报,2004,(03).

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700