朴素贝叶斯算法及其在电信客户流失分析中的应用研究

英文题名：Research on Na(?)ve Bayes Algorithm with Its Application to Customer Churn in Telecommunications
作者：孙源泽
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：数据挖掘 ; 客户流失 ; 分类 ; 朴素贝叶斯 ; 最大价值量 ; 属性选择
英文关键词：Data-mining ; Customer churn ; Classification ; Na?ve Bayes ; Maximum value ; Attribute selecting
学位年度：2008
导师：林亚平
学科代码：081201
学位授予单位：湖南大学
论文提交日期：2008-04-17
答辩委员会主席：文双春

摘要

随着国内外电信市场竞争的加剧,客户流失现象成为企业关注的问题之一。面对日益严重的客户流失状况,电信企业需要用数据挖掘技术来分析客户的流失特性,以便采取措施挽留有价值的客户,从而减少客户流失以降低企业的经济损失。因此电信客户流失预测已成为电信行业面临的重要问题。
     本文重点研究数据挖掘中的朴素贝叶斯分类算法,并将该算法应用到电信行业的客户流失分析中。其主要内容如下:
     (1)针对属性冗余而导致朴素贝叶斯分类性能降低这一问题,提出了一种改进的选择性朴素贝叶斯算法。该算法先按照属性信息增益值的大小对属性进行排序,然后再对属性进行选择,从而提高了分类的准确率。
     (2)针对不同级别、不同数量的客户离网后给电信企业带来的离网预测的问题,提出了一种基于最大价值量的朴素贝叶斯算法。该算法通过建立价值量的概念,调整价值敏感属性的价值系数因子,使得离网客户名单中的价值量达到最大。实验仿真结果表明该算法在保持一定的准确率的同时,能预测更多高价值的离网客户。
     (3)以上述两算法为基础,数据挖掘过程为线索,构建了电信客户流失预测模型。该模型通过改进的选择性朴素贝叶斯算法对属性进行选择,然后利用基于最大价值量的朴素贝叶斯算法进行分类预测,实验仿真结果表明该模型具有较好的分类预测性能。
With the rampant competition in the domestic and international wireless telecommunications industry, the customer churning has become one of matters of concern to the enterprise. Faced with the increasingly serious situation in customer churning , telecom enterprises need data mining technology to analyze the churning in order to take measures to maintain valuable customers, and reduce customers churning to lower economic losses. Therefore the prediction of customer churning has become an important issue in telecommunications industry.
     This theis we focus on the research of Na?ve Bayes classification algorithm, then use the algorithm to analyze the predictation of customer churning in telecommunication. The main contents include:
     (1)An improved selective Na?ve Bayes algorithm is proposed because correlated features could reduce the performance of the Na?ve Bayes classification. At first the algorithm orders the features by imformation gain, then selects the features in order to improves accuracy.
     (2)A new churn prediction issue is brought to the telecom company due to different cost taken after different numbers and levels of customers churn, a Na?ve Bayes algorithm based on the maximum value is proposed in this paper .The algorithm can make the value of the churned customer list reach maximization by establishing the concept of value and adjusting the coefficient of the value sensitivity attribute. Experiments result show that the new algorithm can predict more and more valuable churned customers with maintaining certain accuracy.
     (3)Taking the above two algorithms as the foundation, the process of data mining as the clue, has establish the model of the predication of customer churning. Select the attributes by the improved algorithm of selective Na?ve Bayes, then classify by Na?ve Bayes algorithm based on the maximum value. Experiments result show that the model have a good predicting performance.

引文

[1] Alex Berson, Stephen Smith, Kurt Thearling 著. 构建面向 CRM 的数据挖掘应用[M].贺奇,郑岩, 魏葬等译. 北京:人民邮电出版社,2001:10-20
    [2] MC. Mozer, R Wolniewicz. Predicting subscriber dissatisfaction and improving retention in the wirelesst elecommunications industry[J]. IEEE Transactions on Neural Networks,2000,11(3):690-696
    [3] F Malabocchia, L Buriano, M J Mollo, et al. Mining telecommunications data bases:an approach to support the business management[J]. Network Operations and Management Symposium,1998,1(2):196-204
    [4] Goebel M, Gruenwald L. A survey of data mining and knowledge discovery software tools[J]. ACM SIGKDD,1999,(1):20-33
    [5] MOZER MC, WOLNIEWICZ R. Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry[J].IEEE Transactions on Neural Networks,2000,11(5):690-697
    [6] YAN L, MILLER D J, MOZER M C, et al. Improving prediction of customer behavior in nonstationary environments[C]. In:Proc.of the International Joint Conference on Neural Networks.WashingtonDC:IEEE Computer Society,2001:2258-2263
    [7] Lu J. Predicting customer churn in the telecommunications industry:An application of survivalanalysis modeling using SAS[C]. SAS Group International 27th Annual Conference. Shanghai: IEEE Educational Activities Department, 2000:114-122
    [8] ROSSET S, NEUMANN E. Integrating customer value considerations into predictive modeling[C]. In:Proceedings of the Third IEEE International Conference on Data Mining(ICDM03). WashingtonDC:IEEE Computer Society,2003:283-290
    [9] GUPTA S, KAMAKURA W, LU J, et al. CRM Presentations[C]. Durham: Duke University MSI Teradata Center,2003:100-106
    [10] SCOTT C N, GOLOVNYA M, STEINBERG D. Churn modeling for mobile telecommunications[C]. Durham: Duke University MSI Teradata Center, 2003:20-26
    [11] Jiayin Qi, Yangming Zhang, Yingying Zhang, et al. TreeLogit Model for Customer Churn Prediction[C]. In:Proceedings of the 2006 IEEEAsia-Pacific Conference on Services Computing (APSCC'06). WashingtonDC:IEEE Computer Society,2006:70-75
    [12] 叶进, 林士敏.基于贝叶斯网络的推理在移动客户流失分析中的应用[J]. 计算机应用,2005,3(25):673-675
    [13] 郭明, 郑惠莉, 卢毓伟. 基于贝叶斯网络的客户流失分析[J]. 南京邮电学院学报:自然科学版,2005,5(25):79-83
    [14] 段云峰, 吴唯宁, 李剑威等.数据仓库及其在电信领域中的应用[M].北京:电子工业出版社,2003:12-20
    [15] Jia weihan, Miheline Kamber. 数据挖掘概念与技术[M]. 北京:机械工业出版社,2001:188-194
    [16] 吴志勇 , 吴跃 . 数据挖掘在电信业中的应用研究 [J]. 计算机应用 , 2005,12(S1):213-214
    [17] V S Verykios, E Bertino, N Fovino, et al. State-of-the-art in Privacy Preserving Data Mining[J]. In SIGMOD Record,2004,33 (1):50-57
    [18] Mehta M, Agraal R. SLIQ:A fast scalable classifier for data mining[C].In: Proceedings of 5th International Conference on Extending Database Technology:Lecture Notes in Computer Science. Avignon:Springer Press,1996:18-33
    [19] Jia weihan, Miheline Kamber. 数据挖掘概念与技术[M]. 北京:机械工业出版社,2001:99-101
    [20] Langley P, Iba W, Thompson K. An analysis of Bayesian classifiers[C]. In: Proceedings of the l0th National Conference on Artificial Intelligence. MenloPark:AAAI Press,l992:223-228
    [21] Kononenko I. Seminaive Bayesian classifier[C]. In: Proc. of the 6th European Working Session on Learning. NewYork:Springer-Verlag, 1991:206-219
    [22] Langley P, Sage S. Induction of selective Bayesian classifiers[C]. In: Uncertainty in Artificial Intelligence. SanFrancisco: Morgan Kaufmann Publishers,l994:399-406
    [23] Webb GI, Pazzani MJ. Adjusted probability naive Bayesian induction[C]. In: Proc. of the l1th Australian Joint Conf. on Artificial Intelligence. Berlin:Springer-Verlag,l998:285-295
    [24] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers[J]. Machine Learning,1997,29(2-3):13l-l63
    [25] Geiger D. An Entropy-based Learning Algorithm of Bayesian ConditionalTrees[C]. In:Proc.of the 8th Annual Conference on Uncertainty in Artificial Intelligence. California: Stanford,1992:92-97
    [26] Zheng Z, Webb G I. Lazy learning of Bayesian rules[J]. Machine Learning, 2000,41(1):53-84
    [27] Pazzani M J. Constructive induction of Cartesian product attributes[C]. In: Proceedings of the Conference, ISIS'96:Information,Statistics and Induction in Science. Singapore:World Scientific,1996:66-77
    [28] T Gartner, P A Flach. Wbcsvm:Weighted bayesian classification based on support vector machines[C]. In:Proceedings of the 18th International Conference on Machine Learning. SanFrancisco:Morgan Kaufmann,2001: 207-209
    [29] H Zhang, S Sheng. Learning weighted na?ve bayes with accurate ranking[C]. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). WashingtonDC: IEEE Computer Society,2004:567-570
    [30] Anne Denton, William Perrizo. A Kernel-Based Semi-Na?ve Bayesian Classifier Using P-Trees[C]. In:Proceedings of the Fourth SIAM International Conference on Data Mining. Florida: SIAM,2004:184-187
    [31] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers[J]. Machine Learning,1997,29(2·3) :13l-l63
    [32] Webb JI. Not so Na?ve Bayes:Aggregating one Dependen Estimator[J]. Machine Learning,2005,58(1): 5-24
    [33] H Zhang, L Jiang, J Su. Hidden Na?ve Bayes[C]. In:Proceedings of the Twentieth National Conference on Artificial Intelligence(AAAI-05). Pennsylvania:AAAI Press,2005:919-924
    [34] H Zhang, L Jiang, J Su. Augmenting Na?ve Bayes for Ranking[C]. In:Proceedings of the 22nd Intenrational Conference on Machine Learing(ICML2005). Bonn: ACM Perss,2005:1025-1032
    [35] Kohavi R. Scaling up the accuracy of Na?ve-Bayes classifiers:A decision- tree hybrid[C]. In: Proc. of the 2nd Int’l Conf. on Knowledge Discovery and Data Mining. MenloPark: AAAI Press,1996:202-207
    [36] 黄厚宽, 石洪波, 王志海等.一种限定性的双层贝叶斯分类模[J].软件学报, 2004,15(2):193-199
    [37] 眭俊明, 姜远, 周志华. 基于频繁项集挖掘的贝叶斯分类算法[J]. 计算机研究与发展,2007,44(8):1293-1300
    [38] 陈景年, 黄厚宽, 田凤占等. 用于不完整数据的选择性贝叶斯分类器[J].计算机研究与发展,2007,44(8):1324-1330
    [39] J R Quinlan. C4.5:Programs for Machine Learning[M]. San Francisco:Morgan Kaufmann,1993:40-60
    [40] 宋枫溪 , 高林 .文本分类器性能评估指标 [J]. 计算机工程 ,2004,30(13): 107-109
    [41] Hanley J A,McNeil B J. The Meaning and Use of the Area Under a Receiver Operating Characteristic(ROC) Curve[J]. Radiology,1982,143(1):29- 36
    [42] Adanas N M, Hand D J. Comparing classifiers when the misallocation costs are uncertain[J] . Pattern Recognition,1999, 32(7):1139-1147
    [43] HandD J, TillR J. A Simple Generalisation of the Area Underthe ROC Curve for Multiple Class Classification Problems[J]. Machine Learning,2001, 45(2):171-186
    [44] U Fayyad, K Irani. Multi-interval discretization of continuous-valued attributes for classification learning[C]. In:Proceedings of Thirteenth International Joint Conference on Artificial Intelligence. Francisco:Morgan Kaufmann,1993:1022-1027

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700