基于粗糙集的朴素贝叶斯分类算法研究

英文题名：Research on Naivebayesian Classification Algorithm Based on Rough Set Theory
作者：胡银娥
论文级别：硕士
学科专业名称：计算机技术
中文关键词：数据挖掘 ; 朴素贝叶斯分类 ; 粗糙集 ; 属性选择 ; 集成学习 ; 装袋
英文关键词：data mining ; Na(i|¨)ve Bayesian classifier ; rough set ; attribute selection ; ensemble learning ; Bagging
学位年度：2012
导师：罗可 ; 周友常
学科代码：081202
学位授予单位：长沙理工大学
论文提交日期：2012-03-28
答辩委员会主席：廖桂平

摘要

数据挖掘是信息技术自然演化的结果，是从大量数据中提取或“挖掘”隐藏的、具有潜在意义的知识的复杂过程。其中，对数据进行分类是数据挖掘领域研究的重要课题。贝叶斯分类法是一种具有坚实的数学理论基础以及综合数据先验信息能力的推理方法，其简单形式朴素贝叶斯分类模型由于具有简单而高效等优点得到了广泛的研究与应用。本文对朴素贝叶斯分类算法的分类原理以及优缺点进行了分析，从两个方面对朴素贝叶斯分类模型进行了深入地研究。首先着重研究通过属性选择来减少该模型的条件独立性假设的局限性，然后在此基础上结合集成学习技术来改进该模型。本文主要研究工作如下：
     1.通过分析王国胤等人提出的CEBARKNC属性约简算法存在的两点不足，提出了一种改进的基于条件熵的属性约简算法ASBCE。该算法引入关联规则中的余弦度量来识别不一致实例，并且根据某个属性是强相关则在一定程度上该属性与其他属性之间也存在较强的相关性的思想来删除冗余属性。实验证明，该算法能够得到一个最近似独立的属性子集，从而放松朴素贝叶斯的条件独立性假设。
     2.朴素贝叶斯分类模型基于贝叶斯理论以及条件独立性假设，具有结构简单且计算高效等优点。然而，现实中的数据一般难以满足条件独立性假设前提，此为朴素贝叶斯方法的局限性。为了突破这一局限性以提高分类器的分类效果，通过属性选择来选择一组最近似独立的属性子集是一种有效的改进方法。本文的研究重点是通过属性选择来找到一组最大相关最小冗余的属性子集，所以在ASBCE属性约简算法的基础上，提出了一种基于粗糙集的选择性朴素贝叶斯分类模型RSSNBC。实验结果表明，与经典朴素贝叶斯分类模型相比，RSSNBC模型取得了较好的分类正确率。
     3.为了进一步提高上述单一分类器的分类性能，引入分类器集成学习技术将多个分类器通过某种方法组合，最终得到一个组合分类器。朴素贝叶斯分类模型是一种简单高效的概率统计分类方法，简单精确的分类方法非常适合作为集成学习的基分类器。由于朴素贝叶斯分类模型是一种稳定模型，所以在采用装袋(Bagging)集成算法中嵌入特征选择来增强个体分类器之间的差异性，提高个体分类器的泛化能力。在ASBCE属性约简算法的基础上，提出了一种选择性朴素贝叶斯组合分类算法SNBCE。实验表明，通过集成学习结合特征选择，该算法能更有效地提高分类器的分类效果。
Data mining is the result of the natural evolution of information technology, and is acomplex process of extracting or "mining" hidden and potential value knowledge from thelarge amounts of data. Among data mining technologies, the data classification is animportant research field. Bayes classification method is a kind of reasoning method that has asolid mathematics theoretical foundation and has a ability of integrating prior information anddata sample information. Especially its simple form Na ve Bayesian method has theadvantages of simple and effedtive and has been widely studied and applied.This paperanalyses the classification principle and advantages and disadvantages of the Na ve Bayesianclassification algorithm,and research the Na ve Bayesian classification model from twoaspects.Firstly,this paper emphatically research through attribute selection to relax conditionsindependence limitation of the model, and then based on this integrate ensemble learningtechnology to improve the mode. This paper mainly research works are as follows:
     1. This paper analyzes two defects existing in CEBARKNC attribute reduction algorithmproposed by Wang Guoyin and others, and proposes an improved attribute reductionalgorithm ASBCE based on condition entropy. This algorithm introduces the cosine metric ofassociation rules to identify samples that are not consistent, and according to the mind that ifone attribute is a strong correlation one, then there is a very strong correlation between it andothers in a property degree to delete the redundant attributes.Experiments show that thisalgorithm can get a kind of like independent attributes subset recently,and relax the conditionindependence assumption of Na ve Bayesian.
     2. Based on the Bayesian theory and the condition independence assumption, Na veBayesian classification model has advantages of simple structure and computeefficiency,etc.However,the reality data general has difficult to meet condition independenceassumption this is the limitation of the Na ve Bayesian model.In order to break this limitationto improve its classification effect,through the attribute selection to select an approximateindependent attributes subset is a kind of effective improvement method.The key research ofthis paper is to find a minimum redundancy and maximum related attributes subset through attribute selection.Based on the ASBCE attribute selection algorithm,this paper proposes aselective Na ve Bayesian classification model RSSNBC based on rouge set. The experimentalresults show that, compared with the classic Na ve Bayesian classification model, RSSNBCmodel has better classification accuracy.
     3. In order to further improve the performance of the above-mentioned singleclassification, introduce classifier ensemble technology to combine more than one classifiersthrough specific combination method, and finally into a combination classifier.NaiveBayesian classification model is a simple and efficient probability statistical classificationmethod, simple and accurate classification method is very suitable to serve as the baseclassifier of ensemble learning. According to the Na ve Bayesian classification model is astable model,so embed the attribute selection to enhance diversity between the classifiers inthe use of Bagging ensemble algorithm,and to improve the generalization of individualclassifier. Based on ASBCE attribute selection algorithm, this paper proposes a selectiveNa ve Bayesian combination classification model SNBCE. Through combining the ensemblelearning and attribute selection, experiments shows that, this algorithm can more effectivelyimprove its classification effect.

引文

[1] Ian H．Witten，Eibe Frank．数据挖掘实用机器学习技术．北京：机械工业出版社，2006，98-189
    [2] Jiawei Han，Micheline Kamber．数据挖掘：概念与技术．北京：机械工业出版社，2001，1-12
    [3]姚正安．概率统计在分类器设计中的应用及朴素贝叶斯分类器改进研究：[中山大学硕士学位论文]．广州：中山大学，2010
    [4] C．Ratanamahatana，D．Gunopulos．Feature Selection for the Naive Bayesian Classifierusing Decision Trees．Applied Artificial Intelligence，2003，17(56)：475-487
    [5] Z．Pawlak．Rough sets．London：Kluwer academic publishers，1991，10-60
    [6]谢政．基于贝叶斯方法的分类问题研究：[中南大学硕士学位论文]．长沙：中南大学，2008
    [7]段晶．朴素贝叶斯分类及其应用研究：[大连海事大学硕士学位论文]．大连：大连海事大学，2011
    [8]王国才．朴素贝叶斯分类器的研究与应用：[重庆交通大学硕士学位论文]．重庆：重庆交通大学，2010
    [9]郭雨松．一种启发式贝叶斯分类算法及其在铁路货运客户细分中的应用：[北京交通大学硕士学位论文]．北京：北京交通大学，2008
    [10]孙源泽．朴素贝叶斯算法及其在电信客户流失分析中的应用研究：[湖南大学硕士学位论文]．长沙：湖南大学，2010
    [11] I．Kononenko．Semi-naive Bayesian classifier．Proceedings of the European Conferenceon Intelligence．In：Porto，Portugal．Springer-Verlag，1991，206-219
    [12]王峻．朴素贝叶斯分类模型的研究与应用：[合肥工业大学硕士学位论文]．合肥：合肥工业大学，2006
    [13] Friedman，N Geiger，D Goldszmidt．Bayesian Network Classifier．Machine Learning，1997，103-163
    [14]彭浩威．选择性加权朴素贝叶斯分类方法的探讨：[中山大学硕士学位论文]．广州：中山大学，2010
    [15]杨玉莹．朴素贝叶斯分类器及改进分类效果的若干方法的探讨：[中山大学硕士学位论文]．广州：中山大学，2009
    [16] J Cheng，R．Greiner．Comparing Bayesian Network Classifiers．Proceedings of theFifteenth Conference on Uncertainty in Artificial Intelligence，San Francisco：MorganKanfmann，1999，101-108
    [17] P．Langley，S．Sage．Induction of selective Bayesian classifiers．Proceedings of the TenthConference on Uncertainty in Artificial Intelligence．1994，339-406
    [18] Frank E．Data Mining：Practical Machine Learning Tools and Techniques with JavaImplementation．Seattle：Morgan Kaufmann Publishers．2000，265-314
    [19]邓维斌，黄蜀江，周玉敏．基于条件信息熵的自主式朴素贝叶斯分类方法．计算机应用，2007，27(4)：156-160
    [20]吴明旺．基于粗糙集的数据挖掘属性约简算法研究：[电子科技大学硕士学位论文]．四川：电子科技大学，2006
    [21]张庆生．粗集中的属性选择算法及优化方法：[河北大学硕士学位论文]．河北：河北大学，2009
    [22]邓义剑．增量粗糙集及增量贝叶斯分类器算法研究与应用：[湖南大学硕士学位论文]．长沙：湖南大学，2009
    [23]李岚．基于信息熵的属性约简及其应用：[大连海事大学硕士学位论文]．大连：大连海事大学，2008
    [24] Showron A．Rough setsin KDD．In：Special Invjted speaking，WCC2000．BeiJing：2000，123-145
    [25] Rasiawa H，Skowron A．Approximation logic．In：Proceeding of Mathmatical Methodsof Specification and Synthesis of Systems Conference．Berlin：1985，123-139
    [26]王国胤，于洪，杨大春．基于条件信息熵的决策表约简．计算机学报．2002，25(7)：759-766
    [27]李明等．关丁决策表约简的CEBARKNC算法改进．计算机应用，2006，26(4):864-866
    [28]陈景年．选择性贝叶斯分类算法研究：[北京交通大学博士学位论文]．北京：北京交通大学，2008
    [29]齐敏，李大健，赫重阳．模式识别导论．北京：清华大学出版社，2009，69-75
    [30]张静，王建民，何华灿．基于属性相关性的属性约简新方法．计算机工程与应用，2005，28(13)：55-57
    [31]王广涛，宋擒豹，车蕊．一种新的基于信息熵的属性选择算法．计算机研究与发展．2009，509-514
    [32]邓维斌．基于粗集理论的自主式朴素贝叶斯学习算：[重庆邮电大学硕士学位论文]．重庆：重庆邮电大学，2007
    [33] Blake C L，Merz C J．UCI Repository of Machine Learning Databases．http：//www.ics.uci.edu/~mlearn/MLRepository.html
    [34]焦李成，刘芳，缑水平等．智能数据挖掘与知识发现．西安：西安电子科技大学出版社，2006；431-471
    [35] Frank E．Data Ming:Practical Machine Learning Tools and Techniques with JavaImpplementations．Seattle：Morgan Kaufmann Publishers，2000：265-314
    [36] E．Bauer and R．Kohavi．An empirical comparison of voting classification algorithms：bagging，boosting，and variants．Machine Learning，1999，36(12)：105-1 39
    [37]刘天羽．基于特征选择技术的集成学习方法及其应用研究：[上海大学硕士学位论文]．上海大学，2007
    [38]张丽新．高维数据的特征选择及基于特征选择的集成学习研究：[清华大学硕士学位论文]．北京：清华大学，2004
    [39] Breiman L．Bagging Predictors．Machine Learning．1996，24(2)：123-140
    [40] Efron B，Tibshirani R．An Introduction to the Bootstrap．Chapman & Hall，New York，1993
    [41]王飞．集成学习及其在基因数据分析中的应用研究：[南京师范大学硕士学位论文]．南京：南京师范大学，2010
    [42] Schapire R E．The strength ofweak learnability．Machine Learning，1990，5(2)：197-227
    [43] Freund Y．Boosting a weak algorithm by majority．Information and Computation，1995，121(2)：256-285
    [44] T．G Dietterich．Ensemble Methods in machine learning．Lecture Notes in ComputerScience，2000，234-245
    [45]姜百宁．机器学习中的特征选择算法研究：[中国海洋大学硕士学位论文]．中国海洋大学，2009
    [46] R．E．Banfield and L．O．Hall．Ensemble diversity measures and their application tothinning．Information Fusion，2005．6(1)：49-62
    [47] H．Liu，E Hussain．Discretization：An enabling technique，Data Mining and KnowledgeDiscovery，2002，6(4)：393-423
    [48] Ho．T．k．The random subspace method for constructing decision forests．IEEETransactionson Pattern Analysis and Machine Intelligence，1998，20(8)，832-844．
    [49] R Bryll，R Gutierrez-Osuna，F Quek．Attribute bagging：Improving accuracy ofclassifier ensembles by using random feature subsets．Pattern Recognition，2003，36(6)：1291-1302
    [50] Webb．G．I．．MultiBoosting：a technique for combining boosting and wagging．MachineLearning，2000，40(2)：159-196
    [51] Kotsianti，S．B.，and Kanellopoulos，D．．Combining bagging，boosting and daggingfor classification problems．Knowledge-Based Intelligent Information(KES 2007，LNAI4693)．Vietri sul Mare，Italy：Springer，2007：493-500
    [52] Guruswami，V.，and Sahai，A．．Multiclass learning，boosting and error-correctingcodes．Proceedings of the Twelfth Annual Conference on Computational LearningTheory(COLT 1999)．Santa Cruz，CA：ACM．1999：145-155
    [53]邱玉祥．特征选择和集成学习及其在入侵检测中的应用：[南京师范大学硕士学位论文]．南京：南京师范大学，2008
    [54]简治平．基于集成学习的特征选择及稳定性分析：[中山大学硕士学位论文]．广州：中山大学，2010
    [58]赫丽锋．朴素贝叶斯分类器的集成学习方法研究：[河北大学硕士学位论文]．保定市：河北大学，2009
    [55]陈广花．决策树设计及集成技术研究：[扬州大学硕士学位论文]．扬州：扬州大学，2010
    [56] A．Tsymbal，S．Puuronen and D．w．Patterson．Ensemble feature selection with thesimple bayesian classification．Information Fusion．2003，4(2)：87-100
    [57] L．I．Kuncheva and J．J．Rodriguez．Classifier ensembles with a random linearOracle．IEEE Transaction son Knowledge and Data Engineering，2007，19(4)：500-508

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700