针对包含异常值数据的优化K-MEANS聚类算法

英文题名：Optimization of K-MEANS Clustering Algorithm for Data with Outliers
作者：焦昂
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：K-MEANS聚类算法 ; 异常值 ; 基于聚类的异常值检测 ; 熵
英文关键词：K-MEANS clustering algorithm ; Outlier ; Cluster-Based Outlier Detection ; Entropy
学位年度：2009
导师：张军英
学科代码：081203
学位授予单位：西安电子科技大学
论文提交日期：2009-05-01

摘要

K-MEANS聚类算法是一种广为应用的简捷的迭代算法,其应用价值和重要性受到很多领域的认可。传统的K-MEANS算法以数据点之间的欧氏距离为测度,误差平方和为目标函数。K-MEANS算法对异常值的存在很敏感,因为“均值”本身就不是一个健壮的统计量。而异常值是这样一种极端的观测结果,它们在数值上远离样本数据集的均值,其存在使得所有基于均值和方差的统计测试一定程度地失真。然而,大样本中总会有一定量的异常值。因此K-MEANS聚类算法的效果不可避免地受到异常值的影响。
本文就K-MEANS聚类算法的原理进行了研究,并提出了一个基于异常值删除的K-MEANS优化算法。该算法主要的特点就是利用了K-MEANS聚类算法原理上的缺陷,即会陷入局部极小的特点,在基于聚类的异常值检测的思想下,以聚类的方式寻找异常值并将其删除。算法引入了熵和平衡的概念,作为算法终止的一种条件。为了防止K-MEANS算法陷入某个局部极小而应用了一种类似刺激的机制,即利用类似欠阻尼曲线的变化形式来控制聚类数目的改变,以使K-MEANS算法在陷入某个局部极小而无法继续寻找异常值的时候能够跳出该局部极小,在不断的聚类过程中,能够继续寻找并删除异常值聚类,从而减小K-MEANS聚类算法受异常值的影响,有效地提高了算法寻找聚类中心的能力和聚类的准确率。
The K-MEANS clustering algorithm is a widely used simple iterative method to partition a given dataset into a user-specified number of clusters, k. Its practical value and importance have been acknowledged across different disciplines. The traditional K-MEANS take Euclidean distances of each observation as the measurement, and error square sum the objective function. An outlier is such an extreme observation that is numerically distant from the mean of data sample, which causes all the statistical tests based on mean and variance to distort to some extent. However, a small number of outliers not due to any anomalous condition are to be expected in large samples. Thus K-MEANS inevitably is impacted by the existence of outliers.
This paper researches on the algorithm and measurement of K-MEANS, then proposes an optimization of K-MEANS algorithm based on outlier deletion. The main point is that the defect that K-MEANS would fall into a suboptimization is made advantage in our algorithm. Under the strategy of cluster-based outlier detection, outliers are searched and deleted in clusters. The notion of entropy and balance are invited as a condition to end the clustering process. To avoid K-MEANS from falling into certain suboptimization and ceasing searching for outliers, a mechanism of stimulation is introduced. The number of clusters, k, is changing during the deletion process, following a certain curve. The aim of changing k is to kick iterative process out of the suboptimization which as a result would help continue the outlier deletion process. Thus outliers would be searched and deleted as much as possible. The ability to find cluster centers and cluster correctly is raised effectively after decreasing the influence outliers have on K-MEANS algorithm.

引文

[1].(美)R.格罗思著,侯迪等译."Data Ming: building competitive advantage [M]."西安：西安交通大学出版社,2001.pp.54-60.
    [2]. Jiawei Han, Micheline Kamber著,范明,孟晓峰等译.“数据挖掘概念与技术.”北京：机械工业出版社,2001.pp.275-280.
    [3].谭勇,荣秋生.“一个基于K-MEANS的聚类算法的实现.”湖北民族学院学报(自然科学版).2004,3,22(1).pp.69-71.
    [4].石云平,辛大欣.“基于K-MEANS聚类算法的分析及应用.”西安工业学院学报.2006,2,26,(1).pp.45-48.
    [5]. XindongWu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg. "Top 10 algorithms in data mining." Knowl Inf Syst (2008) 14:1-37. DOI 10.1007/s10115-007-0114-2.2008. pp.1-10.
    [6]. Robert D. Pierce. "Application of the Positive Alpha-Stable Distribution." Proceedings of the IEEE Signal Processing Workshop on21-23.1997,7,245(15). pp.420-424.
    [7]. X.Huang, A.C. Madoc, and M. Wagner. "NOISES REMOVAL FOR IMAGES BY WAVELET-BASED BAYESIAN ESTIMATOR VIA LEVY PROCESS ANALYSIS." IEEE International Conference on Multimedia and Expo (ICME).2004. pp.327-330
    [8]. Panayiotis G. Georgiou, Chris Kyriakakis. "Robust maximum likelihood source localization the case for subGaussian vesus Gaussian." IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING.2006,4,14(4). pp.1470-1480.
    [9]. C. Tzagkarakis, A. Mouchtaris, and P. Tsakalides. "Musical Genre Classification via generalized gaussian and ALPHA-STABLE modeling." IEEE ICASSP.2006. pp.217-220.
    [10]. X. Huang, A. C. Madoc and A.D.Cheethaml. "Multi-noise Removal from Images by Wavelet-based Bayesian Estimator." Proceedings of the IEEE Sixth International Symposium on Multimedia Software Engineering (ISMSE'04).2004,12. pp.258-264.
    [11]. Lyudmila Mihaylova, Paul Brasnett, Alin Achim. "Particle Filtering with ALPHA-STABLE Distributions." IEEE/SP 13th Workshop on 17-20.2005,7. pp.381-386.
    [12]. Ercan E. Kuruoglu and Josiane Zerubia. "Modelling SAR Images with a Generalisation of the Rayleigh Distribution." Conference Record of the Thirty-Fourth Asilomar Conference.2000, 11,1. pp.224-228.
    [13]. D. Salas-Gonz'alez, E. E. Kuruoglu, D. P. Ruiz. "ESTIMATION OF MIXTURES OF SYMMETRIC ALPHA STABLE DISTRIBUTIONS WITH AN UNKNOWN NUMBER OF COMPONENTS." ICASSP.2006. pp.545-548.
    [14]. Ge Xiaohu, Shaokai Yu, Won-Sik Yoon and Yong-Deak Kim. "A New Prediction Method of Alpha-Stable Processes for Self-Similar Traffic." IEEE Communications Society Globecom. 2004. pp.675-679.
    [15]. SAMORODNITSKY G, TAQQU M S. "Stable non-Gaussian random processes [M]." New York: Chapman & Hall,1994. pp.288-317.
    [16]. Moore, D. S. and McCabe, G. P. "Introduction to the Practice of Statistics.3rd." New York: W. H. Freeman,1999. pp.1-500.
    [17]. An American National Standard. "Standard Practice for Dealing With Outlying Observations." Designation: E 178-02. May 10,2002. pp.1-2.
    [18]. D. Hawkins. "Identification of Outliers." Chapman and Hall, London,1980.1-500
    [19]. Lian Duan, Lida Xu, Ying Liu and Jun Lee. "Cluster-based outlier detection." Ann Oper Res (2009) 168. pp.151-168
    [20]. Shao M, Nikias C L. "Signal processing with fractional lower order moment stable processes and their applications [J]." Proc. of the IEEE.1993,81(7). pp.986-1010.
    [21]. Nikias C L, Shao M. "Signal processing with alpha-stable distribution and application [M]." New York: John Wiley & Sons, Inc,1995. pp.13-45.
    [22]. Bernd Girod, Rudolf Rabenstein, Alexander Stenger. "Signals and Systems [M]."影印版,北京：清华大学出版社.2003.pp.17-59.
    [23].吴英,李传文,沈红梅.“牛顿二项式定理的证明及其应用.”中国科技论文在线.http://www.paper.edu.cn,中图分类号：0174.14.
    [24].倪振华.“振动力学.”西安：西安交通大学出版社.1990.pp.61-69.
    [25].R.W.克拉夫,J.彭津著,王光远等译.“结构动力学.”北京：科学出版社.1981.18-21.
    [26]. O. L. Mangasarian and W. H. Wolberg. "Cancer diagnosis via linear programming." SIAM News, vol.23, no.5, Sep.1990. pp.1-18.
    [27]. S. Hawkins, H. He, G. Williams and R. Baxter. "Outlier detection using replicator neural networks." Lecture Notes in Computer Science, vol.2454,2002. pp.170-180.
    [28]. G.J. Gordon, R.V. Jensen, L.L. Hsiao. "Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma." Cancer Research, Vol.62,2002. pp.4963-4967.
    [29]. I. Guyon, J. Weston, S. Barnhill, V. Vapnik, "Gene selection for cancer classification using support vector machines," Mach Learn, Vol.46,2002. pp.389-422.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700