微粒群优化算法及其在高维数据聚类的应用研究

英文题名：Particle Swarm Optimization and Its Applications in High-dimensional Data Clustering
作者：吕艳萍
论文级别：博士
学科专业名称：基础数学
中文关键词：微粒群优化算法 ; 早熟收敛 ; 高维数据 ; 软投影聚类 ; 自动确定聚类数目
英文关键词：Particle swarm optimization ; premature convergence ; high-dimensional data ; soft projected clustering ; automatically determining the number of clusters
学位年度：2009
导师：周昌乐
学科代码：070101
学位授予单位：厦门大学
论文提交日期：2009-05-01
答辩委员会主席：姚敏

摘要

高维数据聚类是数据挖掘领域中的重点和难点。数据挖掘的基本起始点是假设将一个数据对象表示为一个高维特征向量,比如文本文档。传统的聚类算法对高维数据的聚类质量由于维数灾难问题而大大降低。当特征维数增长时,数据对象在高维空间中分布非常稀疏,数据对象之间趋向于等距是普遍现象。在高维数据集中,对一个簇而言,通常存在着大量不相关或是冗余的特征;而不同簇之间的相关特征子集又是不一样的。因而,在高维数据集中,发现在所有维中存在簇的可能性几乎为零。聚类这类高维数据集的技术一般称为子空间聚类或投影聚类,其目的在于从不同的特征子集中发现簇。然而,许多子空间或投影聚类算法的性能也随着子空间规模增长而迅速下降。并且,有些算法需要用户提供一些领域知识帮助调节它们的参数,比如维内数值的最大距离、输入参数的阈值、数据最小密度等,而这些参数往往很难设置。
     为了避免微粒群算法(Particle Swarm Optimization:PSO)在全局优化中陷入局部极值,本论文首先设计了更加有效的微粒群优化算法变种。论文分析了标准PSO算法早熟收敛的原因,提出了自适应扩散混合变异机制微粒群算法(Particle SwarmOptimization based on Adaptive Diffusion and Hybrid Mutation:InformPSO)。结合生物群体信息扩散的习性,设计了一个考虑微粒分布和迭代次数的函数,自适应调整微粒的“社会认知”能力,提高种群的多样性;模拟了基因自组织和混沌进化规律,引入克隆选择使群体最佳微粒gBest实现遗传微变,局部增值,具有变异确定性;利用Logistic序列指导gBest随机漂移,进一步增强逃离局部极值能力。基于种群的随机状态转移过程,证明了新算法具有全局收敛性。与其它几种PSO变种相比,复杂基准函数仿真优化结果表明,新算法收敛速度快,求解精度高,稳定性好,能有效抑制早熟收敛。
     其次,特殊目标函数以及编码设计使得改进的微粒群算法更适合高维数据聚类,改进的微粒群优化算法用于求解高维数据聚类存在的两个问题。第一问题是给定高维数据集中的聚类数目k,如何确定软投影聚类中的变量加权问题。该问题的主要思路是为每个簇寻找一组变量权值,一般被转化为服从许多等式约束的非线性连续函数优化问题。针对于这一问题,我们设计了一个微粒群优化算法(Particle Swarm Optimization for the Variable Weighting Problem:PSOVW)寻求软投影聚类高维数据的最优变量权值。高质量的聚类结果往往需要合适的目标函数以及高效的搜索策略。PSOVW中,我们使用了一个特殊的k-means目标加权函数,该函数倾向于计算每一类在各自相关维的类内方差和而不是不相关维的类内方差和。在优化的目标函数中,新算法同时使用了非正规化的编码来表示变量权值。这种编码将软投影聚类中原本的变量权值问题的受限于等式约束的搜索空间转换成一个冗余的封闭空间,大大便利了搜索进程。跟其它软投影聚类算法相比,PSOVW采用PSO最小化给定的目标函数,因而算法对聚类中心的初始值更不敏感。在算法产生的合成数据集和UcI数据库的实例试验中,PSOVW被证明能大大提高聚类质量。
     高维数据聚类存在的第二个问题是聚类过程中如何自动确定聚类数目,该问题也被视为界约束内一个非线性连续函数的优化问题。针对于该问题,我们设计了(Automatically Determining the Number of Clusters using Particle SwarmOptimization:autoPSO)。特殊编码设计允许autoPSO在迭代中能够表示具备不同聚类数目的划分,而Davies-Bouldin(DB)这个聚类有效性函数用于评价一个数据集不同划分的质量。我们在合成的高维数据集上测试了autoPSO的性能,并将实验结果与其他聚类算法进行比较,实验结果表明,微粒群优化算法自动聚类高维数据具备可行性以及广泛的应用前景。
Clustering high-dimensional data is an important but difficult task in various data mining applications.A fundamental starting point for data mining is the assumption that a data object,such as text document,can be represented as a high-dimensional feature vector. Traditional clustering algorithms struggle with high-dimensional data because the quality of results deteriorates due to the curse of dimensionality.As the number of features increases,data becomes very sparse and distance measures in the whole feature space become meaningless.Usually,in a high-dimensional data set,some features may be irrelevant or redundant for clusters and different sets of features may be relevant for different clusters.Thus,clusters can often be found in different feature subsets rather than the whole feature space.Clustering for such data sets is called subspace clustering or projected clustering,aimed at finding clusters from different feature subspaces.On the other hand,the performance of many subspace/projected clustering algorithms drops quickly with the size of the subspaces in which the clusters are found.Also,many of them require domain knowledge provided by the user to help select and tune their settings,like the maximum distance between dimensional values,the threshold of input parameters and the minimum density,which are difficult to set.
     Developing effective particle swarm optimization(PSO)for clustering high-dimensional data is the main focus of this thesis.First,in order to improve the performance of the conventional PSO algorithm,we analyze the main causes of the premature convergence and propose a novel PSO algorithm,call InformPSO,based on principles of adaptive diffusion and hybrid mutation.Inspired by the physics of information diffusion,we design a function to achieve a better particle diversity,by taking into account their distribution and the number of evolutionary generations and by adjusting their“social cognitive”abilities. Based on genetic self-organization and chaos evolution,we build clonal selection into InformPSO to implement local evolution of the best particle candidate,gBest,and make use of a Logistic sequence to control the random drift of gBest.These techniques greatly contribute to breaking away from local optima.The global convergence of the algorithm is proved using the theorem of Markov chain.Experiments on optimization of unimodal and multimodal benchmark functions show that,comparing with some other PSO variants, InformPSO converges faster,results in better optima,is more robust,and prevents more effectively the premature convergence.
     Then,special treatments of objective functions and encoding schemes are proposed to tailor PSO for two problems commonly encountered in studies related to high-dimensional data clustering.The first problem is the variable weighting problem in soft projected clustering with known the number of clusters k.With the preset number of clusters k,the problem aims at finding a set of variable weights for each cluster and is formulated as a nonlinear continuous optimization problem subjected to bound constraints.A new algorithm,called PSOVW,is proposed to achieve optimal variable weights for clusters.In PSOVW,we design a suitable k-means objective weighting function,in which a change of variable weights is exponentially reflected.We also transform the original constrained variable weighting problem into a problem with bound constraints,using a non-normalized representation of variable weights,and we utilize a particle swarm optimizer to minimize the objective function in order to obtain global optima to the variable weighting problem in clustering.Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality.In addition,the results of the new algorithm are much less dependent on the initial cluster centroids.
     The latter problem aims at automatically determining the number of clusters k as well as identifying clusters.Also,it is formulated as a nonlinear optimization problem with bound constraints.For the problem of automatical determination of k,which is troublesome to most clustering algorithms,a PSO algorithm called autoPSO is proposed.A special coding of particles is introduced into autoPSO to represent partitions with different numbers of clusters in the same population.The DB index is employed as the objective function to measure the quality of partitions with similar or different numbers of clusters.autoPSO is carried out on both synthetic high-dimensional datasets and handcrafted low-dimensional datasets and its performance is compared to other selected clustering techniques. Experimental results indicate that the promising potential pertaining to autoPSO applicability to clustering high-dimensional data without the preset number of clusters k.

引文

[1]P.van der Putten and M.van Someren (eds) .CoIL Challenge 2000:The Insurance Company Case.Published by Sentient Machine Research,Amsterdam.Technical Report,2000..
    [2]R.Agrawal,J.Gehrke,D.Gunopulos,and P.Raghavan.Automatic Subspace Clustering of High Dimensional Data.Data Mining and Knowledge Discovery,2005,11(1):5-33.
    [3]S.Goil,H.Nagesh,and A.Choudhary.Mafia:Efficient and scalable subspace clustering for very large data sets.Technical Report CPDC-TR-9906-010,Northwestern University,1999.
    [4]Kyoung-Gu Woo,Jeong-Hoon Lee,Myoung-Ho Kim and Yoon-Joon Lee.FINDIT:A fast and intelligent subspace clustering algorithm using dimension voting.Information and Software Technology,2004,46(4):255-271.
    [5]CM.Procopiuc,M.Jones,PK.Agarwal,and TM.Murali.A Monte Carlo algorithm for fast projective clustering.Proc.of ACM SIGMOD International Conference on Management of Data,2002,pp.418-427.
    [6]A.Elke,B.Christian,et al.Detection and Visualization of Subspace Cluster Hierarchies.LNCS 4443,2008,pp.152-163.
    [7]Gabriela Moisel,Jaorg Sander and Martin Ester.Robust projected clustering.Knowledge and Information Systems,2008,14(3):273-298.
    [8]Gabriela Moisel,Jaorg Sander.Finding non-redundant,statistically significant regions in high dimensional data:A novel approach to projected and subspace clustering.Proc.of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2008,pp.533-541.
    [9]L.Parsons,E.Haque,H.Liu.Subspace clustering for high dimensional data:A review.SIGKDD Explorations Newsletter,2004,6:90-105.
    [10]C.Domeniconi,D.Papadopoulos,D.Gunopulos,and S.Ma.Subspace clustering of high dimensional data.Proc.of SIAM International Conference on Data Mining.2004.
    [11]C.Domeniconi,D.Gunopulos,S.Ma,B.Yan,M.Al-Razgan,and D.Papadopoulos.Locally adaptive metrics for clustering high dimensional data.Data Mining and Knowledge Discovery Journal,2007,14:63-97.
    [12]J.Z.Huang,Michael K.Ng,H.Rong,and Z.Li.Automated dimension weighting in k-means type clustering.IEEE Trans,on Pattern Analysis and Machine Intelligence,2005,27(5):1-12
    [13]L.Jing,M.K.Ng,J.Z.Huang.An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data.IEEE Trans,on Knowledge and Data Engineering,2007,19(8):1026-1041.
    [14]Maulik,U.,Bandyopadhyay,S.,Genetic algorithm-based clustering technique,Pattern Recognition,2000,33(9):1455-1465.
    [15]Bandyopadhyay,S.,Maulik,U.,Genetic clustering for automatic evolution of clusters and application to image classification,Pattern Recognition,2002,35(6):1197-1208.
    [16]Maulik,U.,Bandyopadhyay,S.,Nonparametric genetic clustering:comparison of validity indices,IEEE Transaction on Systems,Man and Cybernetics,2001,31(1):120-125.
    [17]Julia Handl,Joshua D.Knowles,An Evolutionary Approach to Multiobjective Clustering,IEEE Trans.Evolutionary Computation,2007,11(1):56-76.
    [18]S.Dudoit and J.Fridlyand,A prediction-based resampling method for estimating the number of clusters in a dataset,Genome Biol 3,2002,RESEARCH0036.
    [19]Chris Fraley and Adrian E.Raftery.How many clusters? Which clustering method?-Answers via Model-Based Cluster Analysis.Technical Report,University of Washington,1998.Computer Journal 1998,41:578-588.
    [20]R.Tibshirani,G.Walther,and T.Hastie,Estimating the number of clusters in a dataset via Gap Statistic.Technical report,Standford University,2000.
    [21]Gerardo Beni and Jing Wang,Swarm Intelligence,Proc.7th Ann.Meeting of the Robotics Society of Japan,RSJ Press,1989,pp.425-428.
    [22]Kennedy J,Eberhart RC.Particle Swarm Optimization.In:Proc.of the IEEE Int' 1 Conf.on Neural Network.Perth:IEEE Inc.,1995.pp.1942-1948.
    [23]Eberhart,R.,Kennedy,J.,1995.A new optimizer using particle swarm theory.In:Proceedings of the IEEE Sixth International Symposium on Micro Machine and Human Science,pp.39-43.
    [24]Goldberg,D.E.Genetic Algorithms in Search,Optimization and Machine Learning.Boston:Kluwer.1975.
    [25]Y.Shi and R.C.Eberhart,“A modified particle swarm optimizer,”in Proc.IEEE Congr.Evol.Comput.,1998,pp.69-73.
    [26]Y.Shi and R.C.Eberhart,“Particle swarm optimization with fuzzy adaptive inertia weight,”in Proc.Workshop Particle Swarm Optimization,Indianapolis,IN,2001,pp.101-106.
    [27]M.Clerc and J.Kennedy,“The particle swarm-explosion,stability,and convergence in a multidimensional complex space,”IEEE Transaction Evolutionary Computation,2002,6(1):58-73.
    [28]H.Y.Fan and Y.Shi,“Study on Vmax of particle swarm optimization,”in Proc.Workshop Particle Swarm Optimization,Indianapolis,IN,2001.
    [29]J.Kennedy,“Small worlds and mega-minds:Effects of neighborhood topology on particle swarm performance,”in Proc.Congr.Evol.Comput.,1999,pp.1931-1938.
    [30]J.Kennedy and R.Mendes,“Population structure and particle swarm performance,”in Proc.IEEE Congr.Evol.Comput.,Honolulu,HI,2002,pp.1671-1676.
    [31]X.Hu and R.C.Eberhart,“Multiobjective optimization using dynamic neighborhood particle swarm optimization,”in Proc.Congr.Evol.Comput.,Honolulu,HI,2002,pp.1677-1681.
    [32]K.E.Parsopoulos and M.N.Vrahatis,“UPSO—A unified particle swarm optimization scheme,”in Lecture Series on Computational Sciences,2004,pp.868-873.
    [33]R.Mendes,J.Kennedy,and J.Neves,“The fully informed particle swarm:Simpler,maybe better,” IEEE Transaction Evolutionary Computation,2004,8:204-210.
    [34]T.Peram,K.Veeramachaneni,and C.K.Mohan,“Fitness-distance-ratio based particle swarm optimization,”in Proc.Swarm Intelligence Symp.,2003,pp.174-181.
    [35]P.N.Suganthan,“Particle swarm optimizer with neighborhood operator,”in Proc.Congr.Evol.Comput.,Washington,DC,1999,pp.1958-1962.
    [36]P.J.Angeline,“Using selection to improve particle swarm optimization,”in Proc.IEEE Congr.Evol.Comput.,Anchorage,AK,1998.
    [37]M.Lovbjerg,T.K Rasmussen,and T.Krink,“Hybrid particle swarm optimizer with breeding and subpopulations,”in Proc.Genetic Evol.Comput.Conf.,2001,pp.469-476.
    [38]Liang JJ,Qin AK.Comprehensive Learning Particle Swarm Optimizer for Global Optimization of Multimodal Functions.IEEE Transaction on Evolutionary Computation,2006,10(3):281-295.
    [39]V.Miranda and N.Fonseca,“New evolutionary particle swarm algorithm (EPSO) applied to voltage/VAR control,”in Proc.14~(th) Power Syst.Comput.Conf.,Seville,Spain,2002.[Online].Available:http://www.pscc02.org/papers/s21 pos.pdf.
    [40]T.M.Blackwell and P.J.Bentley,“Don't push me! Collision-avoiding swarms,”in Proc.IEEE Congr.Evol.Comput.,Honolulu,HI,2002,pp.1691-1696.
    [41]M.Lovbjerg and T.Krink,“Extending particle swarm optimizers with self-organized criticality,” in Proc.Congr.Evol.Comput.,Honolulu,HI,2002,pp.1588-1593.
    [42]X.Xie,W.Zhang,and Z.Yang,“A dissipative particle swarm optimization,” in Proc.Congr.Evol.Comput.,Honolulu,HI,2002,pp.1456-1461.
    [43]F.van den Bergh and A.P.Engelbrecht,“A cooperative approach to particle swarm optimization,”IEEE Transaction Evolutionary Computation,2004,8:225-239.
    [44]Castro LN,Zuben FJ.The Clonal Selection Algorithm with Engineering Applications.In:Proc.of GECCO'00.Las.Vegas:ACM Press,2000,36-37.
    [45]F.Van den Bergh,A.P.Engelbecht.Cooperative learning in neural networks using particle swarm optimizers.South African Computer Journal,2000,26:84-90.
    [46]H.Yoshida,K.Kawata,Y.Fukuyama,Y.Nakanishi.A particle swarm optimization for reactive power and voltage control considering voltage security assessment.IEEE Transactions on Power Systems,2000,15(4):1232-1239.
    [47]A.Salman,I.Ahmad,S.A1-Madani,.Particle swarm optimization for task assignment problem.Microprocessors and Microsystems,2003,26(8):363-371.
    [48]M.F.Tasgetiren,M.Sevkli,Y.-C.Liang,G.Gencyilmaz.Particle swarm optimization algorithm for single machine total weighted tardiness problem.Proc.of the 2004 Congress on Evolutionary Computation(CEC'04),Portland,Oregon,2004,pp.1412-1419.
    [49]G.C.Onwubolu,M.Clerc.Optimal operating path for automated drilling operations by a new heuristic approach using particle swarm optimization.International Journal of Production Research,2004,42(3):473-491.
    [50]C Choi,J Lee.Chaotic local search algorithm.Artificial Life & Robotics,1998,2(1):41-47.
    [51]赫然,王永吉等。一种改进的自适应逃逸微粒群算法及实验分析。软件学报,2005,16(12):2036-2044。
    [52]郭崇慧,唐焕文。演化策略的全局收敛性。计算数学,2001,23(1):105-110。
    [53]Mendes R,Kennedy J,and Neves J.The Fully Informed Particle Swarm:Simpler,Maybe Better.IEEE Trans.Evolutionary Computation,2004,8(3):204-210.
    [54]C.Domeniconi,D.Papadopoulos,D.Gunopulos,and S.Ma.Subspace clustering of high dimensional data.Proc.of SIAM International Conference on Data Mining.2004.
    [55]C.Domeniconi,D.Gunopulos,S.Ma,B.Yan,M.Al-Razgan,and D.Papadopoulos.Locally adaptive metrics for clustering high dimensional data.Data Mining and Knowledge Discovery Journal,2007,14:63-97.
    [56]J.Z.Huang,Michael K.Ng,H.Rong,and Z.Li.Automated dimension weighting in k-means type clustering.IEEE Trans,on Pattern Analysis and Machine Intelligence,2005,27(5):1-12.
    [57]L.Jing,M.K.Ng,J.Z.Huang.An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data.IEEE Trans,on Knowledge and Data Engineering,2007,19(8):1026-1041.
    [58]I.W.Evett and E.J.Spiehler.Rule Induction in Forensic Science.Central Research Establishment.Home Office Forensic Science Servic,Aldermaston,Reading,Berkshire RG7 4PN.
    [59]O.L.Mangasarian and W.H.Wolberg.Breast cancer diagnosis via linear programming.SLAM News,1990,23(5):1-18.
    [60]G.Salton and C.Buckley.Term-weighting approaches in automatic text retrieval.Information Processing & Management,1988,24 (5):513-523.
    [61]T.Pang-ning,S.Michael and K.Vipin.Introduction to Data Mining.Pearson Education Inc,2006,pp:77.
    [62]http://kdd.ics.uci.edu/
    [63]http://people.csail.mit.edu/jrennie/20Newsgroups/
    [64]http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
    [65]TREC,Text Retrieval Conference,http://trec.nist.gov/,1999.
    [66]D.Boley,M.Gini,et al.Document categorization and query generation on the world wild web using WebACE.AI Review,1999,11:365-391.
    [67]E.H.Han,D.Boley,et al.WebACE:A web agent for document categorization and exploration.Proc.of 2~(nd) International Conf.on Autonomous Agents.1998.
    [68]Y.Zhao and G.Karypis.Criterion functions for document clustering:Experiments and analysis.Technical Report,CS Dept.,Univ.of Minnesota.2001.
    [69]X.Zhou,X.Hu,X.Zhang,X.Lin,and I.-Y.Song.Context-sensitive semantic smoothing for the language modeling approach to genomic IR,SIGIR'06.2006.
    [70]R.Tibshirani,G.Walther,et al.Estimating the number of clusters in a dataset via the Gap Statistic.Technical Report,Stanford Univeristy.2000.
    [71]J.Handl and J.Knowles.Multiobjective clustering with automatic determination of the number of clusters.Technical Report,UMIST,Department of Chemistry.2004.
    [72]L.Y.Tseng and S.B.Yang,A genetic approach to the automatic clustering algorithm,Pattern Recognition,2001,34(2):415-424.
    [73]C.C.Lai,A novel clustering approach using hierarchical genetic algorithms,Intelligent Automation and Soft Computing,2005,11(3):143-153.
    [74]H.J.Lin,F.W.Yang and Y.T.Kao,An efficient GA-based clustering technique,Tamkang Journal of Science and Engineering,2005,8(2):113-122.
    [75]D.L.Davies,D.W.Bouldin,A cluster separation measure,IEEE Trans.Patt.Anal.Mach.Intell.1979,224-227

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700