引力聚类及其应用研究

英文题名：The Research of Clustering Based on Gravity and Its Application
作者：查丰
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：引力 ; 聚类 ; 层次聚类 ; 覆盖算法 ; 客户关系管理 ; 客户细分
英文关键词：Gravity ; Clustering ; Hierarchical clustering ; Covering algorithm ; CRM ; Customer segmentation
学位年度：2011
导师：贾瑞玉
学科代码：081203
学位授予单位：安徽大学
论文提交日期：2011-03-01

摘要

数据挖掘是近年来热门的计算机应用技术,聚类是数据挖掘中重要的研究分支。聚类技术是将未分类的样本,通过其相似度进行分类,使得类簇内部样本间相似度最大,而不同类簇间相似度最小,从而发现数据集的特性和内部模式。然而,一些数据集的结构和分布呈现高度复杂性,数据挖掘也为聚类带来了大量亟待解决的问题。因此,聚类分析方法进一步研究的空间还很大。
     层次聚类方法是一种常用的聚类算法,通过分解目标数据集来创建一个层次。按照层次的分解方向,它分为自下而上(凝聚方法)和自上而下(分裂方法)两种类型。
     覆盖算法是构造型学习算法,通过找到一组覆盖,使得属于同一类的样本属于同一覆盖,不同类的样本不属于同一覆盖。覆盖聚类算法借鉴覆盖算法的构造性思想,找出一组覆盖,使得属于同一覆盖的样本间距离较小,不同覆盖间的样本间距离较大。我们生活的宇宙,从最初的宇宙大爆炸,宇宙所有的物质都处于混沌状态中,杂乱无章。由于万有引力的作用,使得宇宙中的物质相互吸引、靠近,进而融合形成了星系,恒星,行星等天体。这一过程和数据聚类过程极为相似,都是从最初混沌,通过对混沌中的个体进行某种聚类运算,最终得到结构清晰的聚类结果。正是由于这种相似性,我们把万有引力融入聚类算法中,改进相似度的度量方法,即从单纯的距离作为相似度,到距离与类簇的大小比值作为相似度。本文中研究了层次聚类算法(Hierarchical Clustering, HC)和覆盖聚类算法(Covering clustering algorithm, CCA),在这两个算法中,本文用引力替代距离作为相似度计算公式,提出基于引力的层次聚类算法(Hierarchical Clustering Based on Gravity, HCBG)和基于引力的覆盖聚类算法(Covering Clustering Based on Gravity, CCBG)。实验结果表明以引力作为相似度的聚类结果有一定的改进。
     客户关系管理(Customer Relationship Management, CRM)将最佳的商业实践与数据挖掘、数据仓库、一对一营销、销售自动化以及其它信息技术紧密结合在一起,为企业的销售、客户服务和决策支持等领域提供了一个业务自动化的解决方案。客户细分是CRM技术中一项重要研究内容,通过对客户的有效分类,采用针对性销售策略,达到销售利润最大化。在客户细分中,最重要的两个步骤是数据挖掘和决策支持,数据挖掘即通过聚类算法找出具有相似行为的客户；决策支持即通过贝叶斯分类、决策树等方法,根据某一客户的个人资料,预测他的行为。本文在数据挖掘过程中采用基于引力的层次聚类算法,并通过朴素贝叶斯分类方法,对客户的行为进行了预测。
Data mining is the important application of technology in recent years, data clustering is the important branch of data mining. This kind of technology is to separate those not classified samples to some groups by its similarity, making the similarity in one group is bigger and in different groups is smaller, thus, finding the some internal properties and pattern. However, the structure and distribution of some data sets show high complexity, data mining will bring a lot of problems need to be solved for the clustering. Therefore, There is still great space to further study for suan an approach.
     Hierarchical clustering method is a common clustering algorithm, which create a hierarchy by decomposing given data object sets. Based on the direction of decomposition, Hierarchical clustering can be divided into two methods:bottom-up (condensed) method and top-down (split) method.
     Cover algorithm is constructive learning algorithm, by finding a group of cover, making the same type of samples belong to the same coverage, different types of samples belong to different coverage. Refer to constructive ideas of Cover algorithm, cover clustering algorithm try to find a group of cover, make the distance smaller in the same cover and the distance larger between different covers.
     From the initial Big Bang, all matter in the universe is in a chaotic state. As the role of gravity, making the matter in the universe attract each other, and then fuse to form the galaxies, starts, planets and other celestial bodies. This process is very similar to the process of clustering, according to some kinds of cluster computing, the chaos data ultimately become a clear structure of the clustering results. It is this similarity, we improve similarity measurement method by bring gravity into the clustering algorithm, from simple distance as the similarity to the cluster size as one parameter of similarity. This paper research the Hierarchical Clustering Algorithm (HC) and the Covering Clustering Algorithm (CCA), in both algorithms, using gravity in stead of distance as the similarity, propose Hierarchical Clustering Algorithm based on Gravity (HCBG) and Covering Clustering based on Gravity (CCBG). The results show that the gravity as similarity can improve clustering quality.
     Customer Relationship Management (CRM) is a management philosophy, also is a management software and technology. CRM involves the best business practices, data mining, data warehouse, one to one marketing, sales automation and other information technology. CRM provides a business automation solution that can help company to sale productions and help manager to make decision. Customer segmentation is an important research direction of CRM, by effective classification of customers and targeted marketing strategies, to achieve sales profit maximization. In the customer segmentation, the two most important steps are data mining and decision support, data mining try to find out clustering customers that have the similar behavior; decision support by Bayesian classification, decision tree and other methods, according to customers'personal data to predict his behavior. In this paper, use the HCBG which proposed in the third chapter to do data mining, meanwhile, to predict customer behavior by Bayesian classification methods.

引文

[1]Han JW, Kambr. Data Mining Concepts and Techniques[M].Beijing Higher Education Press, 2001
    [2]Jiawei Han, Micheline Kamber,范明,孟小峰等译.数据挖掘：概念与技术[M].北京：机械工业出版社,2001
    [3]Rnkesh Agrewal, Tomasz Imielinski, Arun Swami. Database Mining:A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering. Special issue on Learning and Discovery in Knowledge-Based Databases.1993:914-925
    [4]史忠植.知识发现[M].清华大学出版社,2002
    [5]杨小兵.聚类分析中若干关键技术的研究[D].浙江大学博士论文,2005
    [6]何虎翼.聚类算法及其应用研究[D]].上海交通大学博士论文,2007
    [7]戴涛.聚类分析算法研究[D].清华大学硕士论文,2004
    [8]陈衡岳.聚类分析及聚类结果评估算法研究[D].黑龙江：东北大学,2006
    [9]Kennedy J, Eberhart R C. Swarm intelligence [M]. San Francisco:Morgan Kaufmann,2001
    [10]Chen M S, Han JW, Yu P S. Data Mining:An Overview from Database Perspective. IEEE Transaction on Knowledge and Data Engineering,1996,8(6):866-883
    [11]张业嘉诚.划分聚类与基于密度聚类算法的改进方法研究[D].大连理工大学硕士论文,2007
    [12]瞿俊.基于重叠度的层次聚类算法研究及其应用[D].厦门大学硕士论文,2007
    [13]于智航.改进的密度聚类算法研究[D].大连理工大学硕士论文,2007
    [14]Alexander, Hinneburg, Daniel, A.Keim. An efficient approach to clustering in large multimedia databases with noise.4th International Conference on Knowledge Discovery and Data Mining.1998
    [15]周水庚,周傲英,金文,范晔,钱卫宁FDBSCAN一种快速DBSCAN算法.软件学报[J].2000,11(6)：735-744
    [16]Ankerst M, Breunig M.M., Kriegel H.-P.. Optics:ordering points to identify the clustering structure. Proceedings ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA,1999:49-60
    [17]牟廉明.基于均匀度和相对密度的网格聚类算法[C].第二届中国智能计算大会论文集：2008,45-49
    [18]宋浩远.基于模型的聚类方法研究[J].重庆科技学院学报(自然科学版),2008,10(3),71-73
    [19]蒋盛益,李庆华.一种基于引力的聚类方法[J].计算机应用,2005,25(2)；286-300
    [20]石剑飞,闫怀志,牛占云.基于凝聚的层次聚类算法的改进[J].北京理工大学学报.2008,28(1)：66-69
    [21]梁斌梅.基于层次聚类的孤立点检测方法[J].计算机工程与应用.2009,45(32)：117-119
    [22]Cliffs, NJ. Johnson, S., Hierarchical clustering schemes. Phychometrika 23,1967:241-254.
    [23]段明秀,杨路明.对层次聚类算法的改进[J].湖南理工学院学报(自然科学版).2008,21(2)：28-36
    [24]SHENG YJ, YUMX. An Efficient Clustering Algorithm [A]. In Proc of 2004 International Conference on Machine Learning and Cy-bernetics[C],2004.8.
    [25]Yichung Hu, Rueyshun Chen, Gwohshiung Tzeng. Finding fuzzy classification rules using data mining techniques. Pattern Recognition Letters.2003(24):50-51
    [26]Jain,R. Dubes. Algorithms for Clustering Data[M]. Englewood Cliffs, Prentice Hall,1988.
    [27]S.Guha,R. Rastogi,K. Shim. ROCK:A robust clustering algorithm for categorical attributes[C]. The Proceedings of International Conference on Data Engieering,1999,512-521.
    [28]S.Guha,R. Rastogi,K. Shim. CURE: an efficient clustering algorithm for large database[J]. Information System,2001,26(1):35-58
    [29]G Karypis,E. H. Han, V. Kumar. CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic Modeling[J]. COMPUTER,1999,32(8):68-75.
    [30]M. Ester,H. P. Kriegel, J. Sander, X. Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases[C]. The Proceeding of International Conference OB Knowledge Discovery and Data Mining Portland,Oregon,1996,226-231.
    [31]付昱华.改进的牛顿万有引力公式[J].自然杂志,2001,23(1)：58-59
    [32]赵妹,张燕平,张铃,张媛,陈传明.覆盖聚类算法[J].安徽大学学报(自然科学版).2005,29[2)：28-32
    [33]张铃,张钹.M-P神经元模型的几何意义及其应用.软件学报,1998,9(5)：334-338
    [34]张铃,张钹.人工神经网络理论及应用[M].浙江科技出版社,1995.
    [35]张铃,吴朝福,张钱,韩枚.多层前馈神经网络的学习和综合算法[J].软件学报,1995,6(7)：440-448
    [36]张铃,张钹,殷海风.多层前向网络的交叉覆盖设计算法[J].软件学报,1999,10(7)：737-742
    [37]吴鸣锐,张铃.一种用于大规模模式识别问题的神经网络算法[J].软件学报,2001,12(6)：851-855
    [38]张燕平,张铃,吴涛.机器学习中的多侧面递进算法MIDA[J].电子学报2005,33(2)：327-331
    [39]严莉莉.基于商空间粒度的覆盖聚类算法[D].安徽大学硕士论文.2007
    [40]王伦文,吴涛,张吴,张铃.一种改进的领域覆盖算法及其应用[J].模式识别与人工智能.2003.3(1)：81-85
    [41]张晓辉.CRM数据挖掘系统框架与实现[D].复旦大学硕士论文,2002.
    [42]王英磊.基于数据挖掘的客户分析在客户关系管理中的应用[D].北京航空航天大学硕士学位论文,2002.
    [43]Kurt Thearling. Data Mining and Customer Relationships[J]. Direct Marketing Magazine, 2000.56(9):179-191.
    [44]蔡卫东.数据挖掘在CRM系统中的应用[D].山东大学硕士论文,2001
    [45]汤绍龙.数据挖掘在客户细分和供应商选择上的应用研究[D].北京航天航空大学硕士论文,2002.
    [46]刘义,万迪昉,张鹏.基于购买行为的客户细分方法比较研究[J].管理科学.2003,16(1)：69-72
    [47]李晓毅,徐兆棣.增量式贝叶斯分类的原理和算法[J].沈阳工业大学学报,2006,28(4)：422-425
    [48]沈其军.SAS统计分析[M].东南大学出版社,2001：124-15

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700