数据挖掘中的聚类方法及其应用

英文题名：Clustering Methods in Data Mining with Its Applications
副题名：基于统计学视角的研究
英文副题名：A Research from the Perspective of Statistics
作者：殷瑞飞
论文级别：博士
学科专业名称：统计学
中文关键词：数据挖掘 ; 聚类分析 ; 统计方法
英文关键词：Data Mining ; Cluster Analysis ; Statistical Method
学位年度：2008
导师：朱建平
学科代码：020208
学位授予单位：厦门大学
论文提交日期：2008-04-01

摘要

数据挖掘是近几年随着数据库和人工智能发展起来的一门新兴技术,它从大量原始数据中发掘出隐含的、有用的信息和知识,帮助决策者寻找数据间潜在的关联,发现被忽略的因素。数据挖掘因其巨大的商业前景,现已成为国际上数据库和信息决策领域最前沿的研究方向之一,并引起了学术界和工业界的广泛关注。
     面对海量数据,首要的任务就是对其进行归类,聚类分析就是对原始数据进行合理归类的一种方法。作为数据挖掘的一项重要功能,聚类分析能作为一个独立的工具来获得数据的分布情况,观察每个类的特点,集中对特定的某些类做进一步的分析。此外,聚类分析也可以作为其它算法的预处理步骤。因此,聚类分析已经成为数据挖掘领域中一个非常活跃的研究课题。
     数据挖掘的相关文献中已经存在大量的聚类方法。然而,从目前来看,对数据挖掘中聚类方法的研究大都集中于计算机科学领域,更多注重聚类算法的研究,或者对现有聚类方法进行算法上的改进,而很少真正从统计学角度出发对数据挖掘中的聚类问题进行深入分析。本文尝试从统计学视角出发,以统计理论为基础,以统计方法与算法的结合为基本思路,将一些现有的优秀统计方法,如因子分析、对应分析、函数型数据分析等引入数据挖掘领域,使其能够应用于海量数据的聚类分析。
     本文共分为六章,各章的内容安排如下:
     第1章介绍了本文的选题背景、研究内容以及本文的主要创新之处。
     第2章首先简单阐述了数据挖掘的定义、功能和常用技术,然后对当前数据挖掘中主要的聚类方法及其研究进展进行了综述,并从聚类标准、类的标识、聚类算法框架三个角度对各种聚类方法进行了全面而深入的对比与总结。
     第3章通过对经典O型因子模型进行改进,克服了其算法效率上的困难,提出了一种新的海量数据聚类方法——Q型因子聚类法,并将其成功应用于上市公司板块分析,为投资决策提供帮助。
     第4章基于Benzécri对应分析的基本思路,结合Q型因子分析的思想,提出了数据挖掘中的对应分析聚类法。利用对应分析聚类法对移动通讯月度消费大型数据库进行聚类分析,实现了移动通讯消费市场的细分。
     第5章借助函数型数据分析的基本思想和方法,建立了一个时序数据库聚类分析的一般框架,并将这一方法扩展到多变量的情形,解决了多变量时序数据的聚类问题。将该方法应用到投资组合风险管理中,利用聚类结果进行资产选择,有效地降低了组合投资风险。
     第6章对全文的主要工作进行了总结,并指出了本文的不足之处以及进一步研究的方向。
     本文尝试在以下几个方面有所创新:
     1.通过对经典Q型因子模型进行改进,克服了其算法效率上的困难,提出了一种新的海量数据聚类方法——“Q型因子聚类法”。
     2.提出了数据挖掘中的“对应分析聚类法”。该方法既解决了Q型因子分析算法效率方面的问题,也解决了传统对应分析法中缺乏客观分类标准、信息损失严重等多种缺陷。
     3.在对应分析聚类法的提出过程中,构造了对应分析中的标准化因子载荷阵,给出了对应分析中因子得分的求解方法,并首次将因子旋转引入对应分析中,在一定程度上扩展了对应分析的方法和理论体系。
     4.借助函数型数据分析的基本思想和方法,建立了一个时序数据库聚类分析的一般框架,在这个框架之下,大量传统的静态聚类方法都可以被应用到时序数据聚类当中。
Data mining is a new technology, developing with database and artificial intelligence. It is a processing procedure of extracting credible, novel, effective and understandable patterns from databases. Owing to its tremendous business prospects, data mining has been one of the most popular research areas in database and information technology, and has received increasing attentions in the past years.
     Cluster analysis is an important data mining technique used to find data segmentation and pattern information. By clustering the data, people can obtain the data distribution, observe the character of each cluster, and make further study on particular clusters. In addition, cluster analysis usually acts as the preprocessing of other data mining operations. Therefore, cluster analysis has become a very active research topic in data mining.
     As the development of data mining, a number of clustering methods have been founded. The recent studies on clustering methods in data mining come mostly from computer science area, paying more attention to clustering algorithm research. The study of clustering technique from the perspective of statistics, however, is relative scarce. Based on the statistical theories, our paper make effort to combine statistical method with the computer algorithm technique, and introduce the existing excellent statistical methods, including factor analysis, correspondence analysis, and functional data analysis, into data mining.
     This paper consists of six chapters, and the main contents of each chapter are outlined as follows:
     Chapter 1 is the introduction, which briefly introduces the research background and issues, contents and frameworks, as well as the contributions of the paper.
     Chapter 2 firstly carries out a review on data mining, the main clustering methods and their recent advances, then analyze systematically these methods from three different viewpoints: clustering criteria, cluster representation and algorithm framework.
     By improving the algorithm of classical Q-mode factor model, chapter 3 put forward a new clustering method for large-scaled database: Q-Mode Factor Clustering Method. This method is used successfully to the listed company board analysis at the last of this chapter.
     In chapter 4, based on the thoughts of Q-mode factor analysis and correspondence analysis, we propose Correspondence Analysis Clustering Method, a new clustering approach in data mining. After clustering the mobile communication consumption data, we realize the segmentation of mobile communication consumption market.
     In chapter 5, a general framework of time series clustering is established by virtual of the thoughts and techniques of functional data analysis. By extending this method to the multivariable condition, we resolve the problem of multivariable time series clustering. Finally, we apply the proposed method to portfolio risk diversification, and the validity is verified through the bootstrap simulation technique.
     Chapter 6 is the summary of the whole paper, including the research conclusions, limitations, and the directions of future research.
     The main innovations in this paper are as follows:
     1. By mending the classical Q-mode factor model, we put forward Q-Mode Factor Clustering Method, which dramatically reduce the time complexity of the algorithm.
     2. We propose a new clustering approach, Correspondence Analysis Clustering Method. The approach is effective in calculation which is an obstacle in Q-mode factor analysis. Additionally, this approach overcomes the subjectivity of traditional correspondence analysis and avoids the lost of information.
     3. In the procedure of Correspondence Analysis Clustering Method, we construct a standardized factor component matrix, resolve the factor score in correspondence analysis, and for the first time introduce factor rotation into correspondence analysis. All of above work expand to some extent the methodology and theory system of correspondence analysis.
     4. By virtual of the thoughts and techniques of functional data analysis, we establish a general framework of time series clustering, under which lots of traditional static clustering method can be applied to time series data.

引文

[1]Friedman,J.H.Data mining and Statistics:What's the Conclusion?[R].Technical Report,Stanford University.
    [2]Fayyad U.,Stolorz U.Data mining and KDD:Promise and Challenges[J].Future Generation Computer Sytems,1997,13(1):99-115.
    [3]郭萌,王珏.数据挖掘与数据库知识发现:综述[J].模式识别与人工智能,1998,11(3):292-299.
    [4]刘同明等.数据挖掘技术及其应用[M].北京:国防工业出版社,2001.
    [5]Han J.,Kamber M.Data Mining:Concepts and Techniques[M].北京:高等教育出版社,2001.
    [6]朱建平.数据挖掘中的统计方法及实践[M].北京:中国统计出版社,2005.
    [7]陈京民.数据仓库与数据挖掘技术[M].北京:电子工业出版社,2002.
    [8]Cleveland W.Visualizing Data[M].Summit,NJ:Hobart Press,1993.
    [9]Devore J.L.Probability and Statistics for Engineering and Sciences[M].New York:Duxbury Press,1995.
    [10]Agrawal R.,Mannila H.,Srikant R.etc.Fast discovery of association rules,Advances in Knowledge Discovery and Data mining[M].AAAI/MIT Press,1996.
    [11]蔡伟杰,张晓辉.关联规则挖掘综述[J].计算机工程,2001,27(5):31-33.
    [12]刘红岩,陈剑.数据挖掘中的数据分类算法综述[J].清华大学学报:自然科学版,2002,42(6):727-730.
    [13]Xiaohui Liu,Gongxian Cheng,Wu J.X.Analyzing outliers cautiously[J].IEEE Tran.on Knowledge and Data Engineering,2002,14(2):432-437.
    [14]Charu C.,Aggarwal,Philip S.Yu.Outlier detection for high dimensional data[J].ACM SIGMOD Record,2001,30(2):37-46.
    [15]王晓晔.时间序列数据分类方法的研究:[D].天津:天津大学,2003.
    [16]欧阳为民,蔡庆生.数据库中的时态数据发掘研究[J].计算机科学,1998,25(4):60-63.
    [17]Qualian J.R.Induction of decision trees[J].Machine Learning,1986,1:81-106.
    [18]袁增任.人工神经元网络及其应用[M].北京:清华大学出版社,2000.
    [19]李敏强.遗传算法的基本理论与应用[M].北京:科学出版社,2002.
    [20]曾黄麟.粗集理论及其应用[M].重庆:重庆大学出版社,1996.
    [21](美)L.A.扎德,陈国权译.模糊集合、语言变量及模糊逻辑[M].北京:科学出版社,1982.
    [22]Everitt,Brian,etc.Cluster Analysis[M].London:Arnold,2001.
    [23]Jain,A.K.,Murty,M.N.,Flynn,P.J.Data clustering:A Review[J].ACM Computing Surveys,31(3),1999.264-323.
    [24]Jain,A.K.,Dubes R.C.Algorithms for Clustering Data[M].NJ:Prentice-Hall,1988.
    [25]Macqueen J.Some methods for classification and analysis of multivariate observations[C]. Proc.5th Berkeley Symp.Math.Statist,1:281-297,1967.
    [26]Kaufman L.,Rousseeuw P.J.Finding Groups in Data:An Introduction to Cluster Analysis [M].New York;John Wiley & Sons,1990.
    [27]Ng R.,Han J.Efficient and effective clustering method for spatial data mining[C].Proc.1994Int.Conf.Very Large Data Bases(VLDB'94),144-155.
    [28]Huang Z.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(2):283-304.
    [29]Zhang T.,Ramakrishnan R.,Livny M.BIRCH:An efficient data clustering method for very large databases[C].Proc.1996 ACM-SIGMOD Int.Conf.Magement of data(SIGMOD'96),103-114.
    [30]Guha S.,Rastogi R.,Shim K.CURE:an efficient clustering algorithm for large databases[C].In:Haas,L.M.,Tiwary,A.,eds.Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.Seattle:ACM Press,1998,73-84.
    [31]Guha S.,Rastogi R.,Shim K.ROCK:A robust clustering algorithm for categorical attributes [C].Proc.1999 Int.Conf.Data Engineering(ICDE'99),512-521.
    [32]Karypis G.,Han E.H.,Kumar V.CHAMELEON:A hierarchical clustering algorithm using dynamic modeling[J].COMPUTER,1999,32(8):68-75.
    [33]Ester M.,Kriegel H.P.,Sander J.,Xu X.A density-based algorithm for discovering clusters in large spatial databases[C].Proc.1996 Int.Conf.Knowledge Discovery and Data Mining (KDD'96),226-231.
    [34]Ankerst M.,Breunig M.,Kriegel H.P.,Sander J.OPTICS:Ordering points to identify the clustering structure[C].Proc.1999 ACM-SIGMOD Int.Conf.Management of data (SIGMOD'99),49-60.
    [35]Hinneburg A.,Keim D.A.An efficient approach to clustering in large multimedia databases with noise[C].Proc.1998 Int.Conf.Knowledge Discovery and Data Mining(KDD'98),58-65.
    [36]裴继法,谢维信.聚类的密度函数方法[J].西安电子科技大学学报,1997,24(4):463-467.
    [37]Wang W.,Yang J.,Muntz R.STING:A statistical information grid approach to spatial data mining[C].Proc.1997 Int.Conf.Very Large Data Bases(VLDB'97),186-195.
    [38]Wang W.,Yang J.,Muntz R.STING+:an approach to active spatial data mining[C].15th International Conference on Data Engineering,1999,116-125.
    [39]Sheikholeslami G.,Chatterjee S.,Zhang A.WaveCluster:A multi-resolution clustering approach for very large spatial database[C].Proc.1998 Int.Conf.Very Large Data Bases (VLDB'98),428-439.
    [40]Agrawal R.,Gehrke J.,Gunopulos D.,raghavan P.Automatic subspace clustering of high dimensional data for data mining applications[C].Proc.1998 ACM-SIGMOD Int.Conf.Management of Data(SIGMOD'98),94-105.
    [41]Rumelhart D.E.,Zipser D.Feature discovery by competitive learning[J].Cognitive Science,1985,9(1):75-112.
    [42]Kohonen T.Self-organization and associate memory[M].Berlin:Springer-Verlag,1984.
    [43]Kohonen Teuvo.The self-organizing map[J].Neurocomputing,1998,21(1-3):1-6.
    [44]Fisher D.Improving inherence through conceptual clustering[C].Proc.1987,AAAI Conf.461-465.
    [45]Gennari J.,Langley P.,Fisher D.Models of incremental concept formation[J].Artificial Intelligence,1989,40(1):11-61.
    [46]Cheeseman P.,Stutz,J.Bayesian classification(AutoClass):theory and results.In:Fayyad,U.M.,Piatetsky-Shapiro,G.,Smyth,P.,et al.,eds.Advances in Knowledge Discovery and Data Mining[M].AAAI/MIT Press,1996,153-180.
    [47]Everitt B.S.,Hand D.J.Finite Mixture Distributions[M].Chapman & Hall CRC,London,1981.
    [48]胡玉锁,陈宗海.基于混合遗传算法的聚类分析[J].模式识别与人工智能,2001,14(3):352-355.
    [49]Mali U.,Bandyopadhyay S.Genetic al gorithm-based clustering technique[J].Patten Recognition,2000,33(9):1455-1465.
    [50]Eschrich S.,Jingwei Ke,Hall L.O.etc.Fast accurate fuzzy clustering through data reduction [J].IEEE Transactions on Fuzzy Systems,2003,11(2):262-270.
    [51]Zahid N,Abouelala O,Limouri M,etc.Fuzzy clustering based on K-nearest-neighbours rule [J].Fuzzy Sets and Systems,2001,120(2).
    [52]Strehl A,Ghosh J.Relationship-based clustering and visualization for high-dimensional data mining[J].INFORMS J COMPUT,2003,15(2):208-230.
    [53]Milenova B.L.,Campos M.M.O-Cluster:scalable clustering of large high dimensional data sets[C].IEEE International Conference on Data Mining,2002,290-297.
    [54]Daniel B.A.,Ping Chen Using Self-Similarity to Cluster Large Data Sets[J].Data Mining and Knowledge Discovery,2003,7(2):123-152.
    [55]Wei Chi-Ping,Lee Yen-Hsien,Hsu Che-Ming.Empirical comparison of fast partitioning-based clustering algorithms for large data sets[J].Expert Systems with Applications,2003,24(4):351-363.
    [56]Maharaj E.A.Cluster of Time Series[J].Journal of Classification,2000,17(2):297-314.
    [57]Guedalia I.D.,London M.,Werman M.An on-line agglomerative clustering method for non-stationary data[J].Neural Computation,1999,11(2).
    [58]Jung-Hua Wang,Jen-Da Rau,Wen-Jeng Liu.Two-stage clustering via neural networks[J].IEEE Transactions on Neural Networks,2003,14(3):606-615.
    [59]Veenman C.J.,Reinders M.J.T.,Backer E.A maximum variance cluster algorithm,[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(9):1273-1280.
    [60]Cattell Raymond B.The Three Basic Factor Analytic Research Designs-Their Interrelations and Derivatives[J].Psychological Bulletin,1952,49:499-520.
    [61]Stephenson,William.Some Observations on Q Technique[J].Psychological Bulletin,1952,49:483-498.
    [62]Richard A.,Johnson,Dean W.,Wichern.Applied Multivariate Statistical Analysis(5th Ed)[M].北京:中国统计出版社,2003.
    [63]Guttman L.The quantification of a class of attributes:A theory and Method of scale construction[C].The Committee on Social Adjustment(ed.),The Prediction of Personal Adjustment.New York:Social Science Research Council,1941.
    [64]Benzecri,J.P.Statistical analysis as a tool to make patterns emerge from data[C].In S.Watanabe(ed.),Methodologies of Pattern Recognition.New York:Academic Press,1969:35-74.
    [65]Bartlett M.S.The Statistical Conception of Mental Factors[J].British Journal of Psychology,1937,28:97-104.
    [66]于秀林,任雪松.多元统计分析[M].北京:中国统计出版社,1999.
    [67]陶凤梅.对应分析的数学模型[D].吉林:吉林大学,2005.
    [68]Ramsay J.O.When the data are functions[J].Psychometrika,1982,47(4):379-396.
    [69]严明义.函数性数据的统计分析:思想、方法和应用[J].统计研究,2007,2:87-94.
    [70]Ramsay J.O.,Silverman B.W.Functional data analysis[M].New York:Springer,Berlin Heidelberg,1997.
    [71]Escabias M.,Aguilera A.M.and Valderrama M.J.Principal component estimation of functional logistic regression:discussion of two different approaches.[J].Journal of Nonparametric Statistics,2004,16:365-384.
    [72]James G.M.Generalized linear models with functional predictors[J].Journal of the Royal Statistical Society,2002,64(3):411-432.
    [73]Gareth M.J.,Catherine A.S.Clustering for sparsely sampled functional data[J].Journal of American Statistical Association,2003,98:397-408.
    [74]Abraham C.,Cornillon P.A.,Matzner-Lober E.,Molinari N.Unsupervised curve clustering using B-splines[J].Scandinavian Journal of Statistics,2003,30:581-595.
    [75]Tarpey T.,Kinateder K.K.J.Clustering functional data[J].Journal of Classification,2003,20:93-114.
    [76]朱建平,陈民恳.面板数据的聚类分析及其应用[J].统计研究,2007,4:11-14.
    [77]Marron J.S.,Tsybakov A.B.Visual Error Criteria for Qualitative Smoothing[J].Journal of American Statistical Association,1995,90:499-507.
    [78]Heckman N.E.,Zamar R.H.Comparing the Shapes of Regression Functions[J].Biometrika,2000,87:135-144.
    [79]Markowitz H.Portfolio Selection[J].Journal of Finance,1952,7(1):77-91.
    [80]Evans J.F.,Arch S.H.Diversification and the reduction of dispersion:An empirical analysis [J].Journal of Finance,1968,23(5):761-767.
    [81]Johnson K.H.,Shannon D.S.A note on diversification and the reduction of dispersion[J].Journal of Financial Economics,1974,4:365-372.
    [82]施东晖.上海股票市场风险性研究[J].经济研究,1996,10:44-48.
    [83]吴世农,韦绍永.上海股市投资组合规模和风险关系的实证研究[J].经济研究,1998,4:21-29.
    [84]杨继平,张力健.沪市股票投资组合规模与风险分散化关系的进一步研究[J].系统工程理论与实践,2005,10:21-28.
    [85]CliffA.D.,Haggett P.,The application of multidimensional scaling methods to epidemiological data[J].Statistical Methods in Medical Research,1995,4:102-123.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700