基于Kmeans++聚类的朴素贝叶斯集成方法研究

英文篇名：Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering
作者：钟熙 ; 孙祥娥
英文作者：ZHONG Xi;SUN Xiang-e;National Electrical and Electronic Demonstration Center for Experimental Education,Yangtze University;
关键词：朴素贝叶斯 ; 差异性 ; Kmeans++聚类 ; 集成学习
英文关键词：Naive bayes;;Difference;;Kmeans++ clustering;;Esemble learning
中文刊名：JSJA
英文刊名：Computer Science
机构：长江大学电工电子国家级实验教学示范中心;
出版日期：2019-06-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金(51604038)资助
语种：中文;
页：JSJA2019S1095
页数：4
CN：S1
ISSN：50-1075/TP
分类号：449-451+461

摘要

朴素贝叶斯方法简单、计算高效、精确度高,且具有坚实的理论基础,得到了广泛应用。文中针对差异性是集成学习的关键条件,提出了基于Kmeans++聚类技术来提高朴素贝叶斯分类器集成差异性的方法,从而提升了朴素贝叶斯的泛化性能。首先,通过训练样本集训练出多个朴素贝叶斯基分类器模型;然后,为了增大基分类器之间的差异性,利用Kmeans++算法对基分类器在验证集上的预测结果进行聚类;最后,从每个聚类簇中选择泛化性能最佳的基分类器进行集成学习,最终结果由简单投票法得出。利用UCI标准数据集对该方法进行验证,结果表明该方法的泛化性能得到了较大的提升。
Naive Bayes is widely applied because of its simple method,high computation efficiency,high accuracy and solid the oretical foundation.Since the difference is a key condition of ensemble learning,this paper studied the method for improving the ensemble difference of naive Bayes classifier based on kmeans++ clustering technology,so as to improve the generalization performance of naive Bayes.Firstly,plurality of naive Bayesian classifier models are trained through a training sample set.In order to increase the difference between the base classifiers,Kmeans++ algorithm is used to cluster the prediction results of the base classifiers on the verification set.Finally,the base classifier with the best generalization performance is selected from each cluster for ensemble learning,and the final result is obtained by simple voting method.UCI standard data sets are used to verify the algorithm at the end of this paper,and its generalization performance has been greatly improved.

引文

[1] 周志华.机器学习[M].北京:清华大学出版社,2016:2-4.
    [2] HARRINGTON P.机器学习实战[M].李锐,李鹏,曲亚东,等译.北京:人民邮电出版社,2013:171-173.
    [3] DIETTERICH T G.Machine learning research:four current directions[J].AI Magazine,1997,18(4):97-136.
    [4] ZHOU Z H,WU J,TANG W.Ensembling neural networks:many could be better than all[J].Artificial intelligence,2002,137(1):239-263.
    [5] BLACK C,KEOGH E,MERZ C J.UCI repository of machine lear-ningdatabase[EB/OL].http://www.ics.uci.edu/～mlearn/MLReposito-ry.html.1998.
    [6] 郭英明,李虹利.基于斯皮尔曼系数的加权朴素贝叶斯分类算法研究[J].信息与电脑,2018(13):57-59.
    [7] JIANG Q,WANG W,HAN X,et al.Deep feature weighting in Nai-ve Bayes for Chinese text classification[C]//International Conference on Cloud Computing and Intelligence Systems(CCIS).Beijing,2016:160-164.
    [8] 邓广彪,黄振功,岳晓光.基于Nesterov平滑的高阶路径朴素贝叶斯文本隐式分类研究[J].西南师范大学学报(自然科学版),2018,43(7):107-112.
    [9] KATKAR V D,KULKARNI S V.A novel parallel implementation of Naive Bayesian classifier for Big Data[C]//International Conferen-ce on Green Computing,Communication and Conservation of Energy (ICGCE).Chennai,2013:847-852.
    [10] ZAGORECKIA.Feature selection for naive Bayesian network ensemble using evolutionary algorithms[C]//Federated Conference on Computer Science and Information Systems.Warsaw,2014:381-385.
    [11] TSYMBAL A,PUURONEN S,PATTERSON D W.Ensemble f-eature selection with the simple Bayesian classification[J].Information Fusion,2003,4(2):87-100.
    [12] 张剑飞,刘克会,杜晓昕.基于k阶依赖扩展的贝叶斯网络分类器集成学习算法[J].东北师大学报(自然科学版),2016,48(1):65-71.
    [13] 王玲娣,徐华.一种基于聚类和AdaBoost的自适应集成算法[J].吉林大学学报(理学版),2018,56(4):917-924.
    [14] GIACINTO G,ROLI F.Design of effective neural network ense-mbles for image classification purposes[J].Image and Vision Comput-ing,2001,19(9):699-707.
    [15] 何梦娇,杨燕,王淑营.一种基于非负矩阵分解的聚类集成算法[J].计算机科学,2017,44(9):58-61.
    [16] HAN J W,KAMBER M.数据挖掘概念与技术[M].范明,孟小锋,译.北京:机械工业出版社,2000:173-175.
    [17] ARTHUR D,VASSILVITSKII S.k-means++:the advantages of careful seeding[C]//In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.New Orleans,SIAM,2007:1027-1035.
    [18] BREIMAN L.Bagging predictors[J].Machine learning,1996,24(2):123-140.
    [19] KROGN A,VEDLEBSBY J.Neural network ensembles,cross v-alidation and active learning[C]//International Conference on Neural Information Processing Systems.MIT Press,1994:231-238.
    [20] 李凯,李昆仑,崔丽娟.模型聚类及在集成学习中的应用研究[J].计算机研究与发展,2007(S2):203-207.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700