基于密度的聚类算法研究及其在电信客户细分中的应用

英文题名：A Clustering Algorithm Based on Density with Its Application in the Customer Cluster in the Field of Telecom
作者：陈园园
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 客户细分 ; 聚类 ; 密度 ; 交叠分区
英文关键词：Data Mining ; Customer Cluster ; Cluster ; Density ; Overlapping Division
学位年度：2008
导师：陈治平
学科代码：081202
学位授予单位：湖南大学
论文提交日期：2008-05-15
答辩委员会主席：邝继顺

摘要

伴随着电信市场的迅速发展,电信客户逐渐呈现出细分化、多元化的特征,电信企业的竞争焦点和发展机遇将更多的集中到各细分市场中。运营商要保持市场的领先地位以及不断提升客户价值,必须主动进行客户细分。因此如何有效地利用数据挖掘方法对客户进行细分是目前数据挖掘应用的一个非常热门且具有重要应用价值的研究课题。
     论文对数据挖掘基本方法之一的聚类技术进行了较全面的比较研究,并利用改进的聚类算法来细分电信业客户,从而达到可识别具有相似特征的客户群,成为分析客户和形成市场策略的基础。本文主要研究工作与特色有:
     1)针对基于密度的聚类方法不能发现密度分布不均的数据样本的缺陷,提出了一种基于代表点和点密度的聚类算法(CBRD)。算法以代表点的平均密度作为类密度,代表点的k近邻为代表区域,根据类密度,将满足密度阈值的代表区域中的点选为代表点,再利用选出的代表点调整类密度,如此反复的寻找出所有代表点和代表区域。所有区域相连的代表点及其代表区域将构成一个聚类,不在任何一类中的点则被作为噪声数据。实验结果显示,该方法可以发现任意形状的密度分布不均的类。
     2)提出的CBRD算法虽然能够发现任意形状的聚类,但是在数据量大的时候需要较多的内存和I/O消耗,导致其在客户细分中不能取得好的应用。因此,在CBRD聚类算法思想的基础上,本文提出了一种基于数据交叠分区的高效密度聚类算法,算法继承了CBRD聚类算法可以发现任意形状的密度分布不均的类的优点,同时还具有较高的运行效率。
     3)将改进后的密度聚类算法应用于电信客户细分,可以使企业更好的掌握市场动态以及对潜在客户挖掘提供有力的技术支持。实验结果证实了该聚类算法的有效性。
Along with the rapid development of the telecommunications market, telecommunications services are popular among consumers, and telecommunications is gradually showing features of subdivision and diversified. To maintain the leadership position in the market and continuously upgrade customer value, operators must take the initiative to conduct customer segmentation. So how to make effective use of data mining methods on customer segment has a very popular and important application value of the research topic of data mining application.
     This paper carries a comprehensive comparative study on clustering technology, one of the basic method of data mining, and uses the improved clustering algorithm to segment the telecommunication customers with the purpose of identifying the customers having similar characteristics and being the base foundation of market analysis and market strategy formation.
     In this paper, the researches and characteristics are:
     1) Aimed to solve the problem that the density-based clustering algorithm dose not work well when data distribution is not even, a new clustering algorithm based on representatives and point density is provided. The algorithm sets the cluster density with the average density of representative points and sets the k neighbors of representative points as representative region. Under the density of cluster, the points in the representative region which meets the density threshold will be selected as representative point and be reused to adjust the density of cluster. And so repeatedly find out all the representative points and regions. All the region-linked representative points and regions will form a cluster, and any points in no clusters are noises. The experimental results revealed that the algorithm can find any shape and uneven distribution of the density of clusters.
     2) Although the CBRD algorithm can detect any shape of clusters, but it needs lots of memory and I/O consumption when the data has a large amount, resulting no good application in customer clustering. Therefore, based on the CBRD algorithm, an efficient clustering algorithm based on data overlap is carried in this paper. This algorithm inherits the CBRD clustering algorithm and can find any shape uneven distribution of the density of clusters, also has a high operating efficiency.
     3) Successfully to applicant the improved density clustering algorithm in telecommunications customer clustering, so that enterprises can better grasp of market dynamics and give effective technical support for mining the potential customers. The experimental result confirms the validity of the clustering algorithm.

引文

[1] Han. J, K. Micheline. Data Mining Concepts and Techniques. China Machine Press. 2001
    [2] 曾进,唐守廉.从资费竞争压力看中国移动公司的潜在盈利性.当代通信,2006, 73 -75
    [3] Zakrzewska D, Murlewski J. Clustering algorithms for bank customer segmentation. In: the 5th lnternational Conference on lntelligent Systems Design and Applications .Wroclaw, 2005, 197-202
    [4] Boone D S., Roehm M. Retail Segmentation using artificial neural networks. International Journal of Research in Marketing,2002,19:287-301
    [5] Shin HW, Sohn SY. Segmentation of stock trading customers according to potentia1 value. Expert systems with Application, 2004,27:27-33
    [6] 周颖,吕巍,井淼.基于数据挖掘技术的移动通信行业客户细分.上海交通大学. 2007,41(7):1142-1145
    [7] 阎长顺,李一军.基于云模型的动态客户细分分类模型研究.哈尔滨工业大学学报. 2007,39(2),299-302
    [8] 范英 ,张忠能 ,凌君逸 .聚类方法在通信行业客户细分中的应用 .计算机工程.2004,30(12):440-441
    [9] E. Abascai, L. Garcia ILautre, F. Mallor. Data mining in a bicriteria clustering Problem. European Journal of Operational Research,2005,3:1-12
    [10] 陈凤洁.电信客户细分方法及应用.科技和产业,2005,5(11):10-12
    [11] 吕巍,蒋波,陈洁.基于 K-.means 算法的中国移动市场顾客行为细分策略研究.管理学报.2005,2(1):80-84
    [12] Rui Xu , Donald Wunsch II. Survey of Clustering Algorithm. IEEE Transactions on Neural Networks, 2005,16(3):645-678
    [13] Martin Ester, Hans-Peter Kriegel, Jorg Sander, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: The 2nd International Conference on Knowledge Discovery and Data Mining. Portland: ACM Press. 1996, 226-231
    [14] A. K. Jain, M. N. Murty, P. J. Flynn. Data clustering: A survey. ACM Comput. Surv., 1999,31:264-323
    [15] Ian Davidson .Understanding K-Means No-hierarchical Clustering .Suny Albany - Technical Report 2002-2, http://www.cs.albany.edu/~davidson/courses/CS1635Understanding K-Means Clustering.pdf
    [16] Boley D. L.. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 1998,2(4):325-344
    [17] Karypis G, Han E H, Kumar V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling, Computer, 1999,32:68-75
    [18] L. Kaufman, P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Technometrics. New York: John Wiley & Sons, 1992,34 (1):111-112
    [19] Michael D. Beynon, Tahsin M. Kurc, Alan Sussman, et al: Optimizing execution of component-based applications using group instances. Future Generation Comp. Syst. 2002,18(4) :435-448
    [20] H.-D. Jin, M.-L. Wong, K.-S. Leung, Scalable Model-Based Clustering by Working on Data Summaries. In: the Third IEEE Int'l Conf. Data Mining, Washington .DC. 2003,91-98
    [21] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery,1998,2:283-304
    [22] R. Ng, J. Han. Efficient and effective clustering method for spatial data mining. In: Int. Conf. Very Large Data Bases(VLDB' 94), Santiago, Chile, Sept. 1994,144-155
    [23] Zhang T., Ramakrishnan R., Livny M., BIRCH: An Efficient Data Clustering Method for very Large Databases, In: ACM-SIGMOD Int. Conf. Management of Data(SIGMOD’96). Montreal, Canada, 1996 ,103-114
    [24] Guha S, Rastogi R, Sim K. CURE: An efficient clustering algorithm for large databases. In: Proc. of the ACM SIGMOD Conference , Seattle, WA, 1998,73-84
    [25] Sheikholeslami G, Chatterjee S, Zhang AD. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In: The 24th Int'l Conf. on Very Large Data Bases. New York: Morgan Kaufmann, 1998,428-439
    [26] Rakesh A, Johanners G, Dimitrios G, et al. Automatic subspace clustering of high dimensional data for datamining applications. In: Snodgrass RT, Winslett M, eds. Proc. of the 1994 ACM SIGMOD Int'1 Conf. on Management of Data. Minneapolis: ACM Press, 1994,94-105
    [27] N. Beckmann, H.P. Kriegel, R. Schneider, et al. The R*-tree: An efficient and Robust Access Method for Points and Rectangles. In: Proceedings of ACM SIGMOD International Conference on Management of Data. Atlantic city, NJ. 1990,322-331
    [28] Mihael Ankerst, Markus M., Breunig., et al. OPTICS: ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'99), Philadelphia, ACM Press. 1999,49-60
    [29] Agrawal R., Gehrke J., Gunopulos D., et al, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, In: Int. Conf. Management of Data, Seattle, Washington, 1998,94-105
    [30] Borah, B., Bhattacharyya, D.K., An Improved Sampling-Based DBSCAN for Large Spatial Databases, Intelligent Sensing and Information Processing, 2004,92-96
    [31] Guan Ji-hong, Zhou Shui-geng. Scaling up the DBSCAN Algorithm for Clustering Large Spatial Databases Based on Sampling Technique. Wuhan University Journal of Natural Sciences,2001,6(1-2):467-473
    [32] E. Januzaj, H. P. Kriegel, M. Pfeifle. DBDC: Density Based Distributed Clustering. In: The 9th Int. Conf. Extending Database Technology(EDBT), Heraklion, Greece, 2004, 88-105
    [33] B. Borah, D. K. Bhattacharyya. A clustering technique using density difference. In: Proceedings of International Conference on Signal Processing, New York. 2007: 585-588
    [34] Yip, A.M. Ding, C. Chan, et al. Dynamic cluster formation using level set methods, IEEE Trans. Pattern Anal. Mach. Intel, 2006,28(6):877-889
    [35] Viswanath. P., Pinkesh. R., l-DBSCAN: A Fast Hybrid Density Based Clustering Method. In: The 18th International Conference on Pattern Recognition, Hong Kong. 2006,912 -915
    [36] 倪巍伟,孙志挥,陆介平. K-LDCHD 高维空间 k 邻域局部密度聚类算法. 计算机研究与发展, 2005,42 (5):784-791
    [37] 周水庚 ,周傲英 ,金文 . FDBSCAN: A Fast DBSCAN Algorithm.软件学报 , 2000,11(06):735-744
    [38] 蔡颖混,谢昆青,马修军. 屏蔽了输入参数敏感性的 DBSCAN 改进算法.北京大学学报(自然科学版), 2004,40 (3):480-486
    [39] 周水庚,周傲英,曹晶.基于数据分区的 DBSCAN 算法.计算机研究与发展, 2000,37 (10):1153-1159
    [40] 周水庚,范哗,周傲英.基于数据取样的 DBSCAN 算法.小型微型计算机系统, 2000,21(12):1270-1274
    [41] 马帅,王腾蛟,唐世渭等.一种基于参考点和密度的快速聚类算法.软件学报,2003,14 (6):1089-1095
    [42] Yi-Pu Wu, Jin-Jiang Guo, Xue-Jie Zhang. A Linear DBSCAN Algorithm Based on LSH. In: Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 2007,2608-2614
    [43] Chih-Yang Lin, Chin-Chen Chang, Chia-Chen Lin: A New Density-Based Scheme for Clustering Based on Genetic Algorithm. Fundamenta Informaticae, 2005, 68(4): 315-331
    [44] Derya Birant, Alp Kut. ST-DBSCAN: An algorithm for clustering spatial- temporal data, Data & Knowledge Engineering, 2007,60(1): 208-221
    [45] 宋明,刘宗田.基于数据交叠分区的并行 DBSCAN 算法.计算机应用研究. 2004 (07) :17-20
    [46] 薛永生,翁伟,文娟等.LSNCCP—一种基于最大不相含核心点集的聚类算法.计算机研究与发展.2004(41) :1930-1935
    [47] 宋明.刘宗田.基于高性能计算机并行聚类算法研究:[上海大学硕士学位论文].上海: 上海大学计算机应用技术,2004,26-27
    [48] UCI Machine Learning Repository :Breast Cancer Wisconsin(Diagnostic) Data Set.http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
    [49] Sebastiani F. A tutorial on automatic text categorization. In: Procedings of the 1st Argentinean Symposium on Artificial Intelligence(ASAI 99) . Buenos Aires, AR, 1999,7-35
    [50] 刘英姿,吴昊.客户细分方法研究综述.管理工程学报.2006,20(1):12-14
    [51] 薄树奎,李盛阳,朱重光.基于统计学的最近邻查询中维度灾难的研究.计算机工程.2006,32:6-8
    [52] 郭明,郑惠莉.基于数据挖掘的电信客户流失分析:[南京邮电学院硕士学位论文].南京:南京邮电学院管理科学与工程,2005,42-44
    [53] 郭志懋 , 周傲英 . 数据质量和数据清洗研究综述 . 软件学报 . 2002(08) : 2076-2082
    [54] 王军,高学东.移动通信客户细分应用研究:[北京科技大学硕士学位论文].北京:北京大学科技大学工商管理学院,2006,37-42

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700