融合SMOTE与Filter-Wrapper的朴素贝叶斯决策树算法及其应用

英文篇名：Naive Bayesian Decision Tree Algorithm Combining SMOTE and Filter-Wrapper and It's Application
作者：许召召 ; 李京华 ; 陈同林 ; 李昕洁
英文作者：XU Zhao-zhao;LI Ching-hwa;CHEN Tong-lin;LEE Shin-jye;School of Software,Yunnan University;Key Laboratory in Software Engineering of Yunan Province;
关键词：数据平衡 ; Wrapper特征选择 ; 朴素贝叶斯 ; 决策树
英文关键词：Data balance;;Wrapper feature selection;;Naive Bayesian;;Decision tree
中文刊名：JSJA
英文刊名：Computer Science
机构：云南大学软件学院;云南省软件工程重点实验室;
出版日期：2018-09-15
出版单位：计算机科学
年：2018
期：v.45
基金：国家自然科学基金:云计算环境下双模型驱动面向软件动态演化的建模与分析(61379032)资助
语种：中文;
页：JSJA201809011
页数：6
CN：09
ISSN：50-1075/TP
分类号：72-76+81

摘要

如何对以"工业4.0"为背景的物联网智慧医疗系统所产生的医疗数据进行高效且准确的挖掘仍然是一个十分严峻的问题。而医疗数据往往是高维的、不平衡的和有噪声的,因此提出一种新的数据处理方法——将SMOTE方法与Filter-Wrapper特征选择算法融合,并将其应用于支持临床医疗决策。特别地,所提方法不仅克服了朴素贝叶斯在属性实际应用中因属性独立假设而造成的预测不佳的情况,而且避免了C4.5决策树在构建模型时的过拟合问题。将所提算法应用于ECG临床医疗决策中,取得了很好的效果。
How to efficiently and accurately dig out the medical data generated by the Internet-based wisdom medical system with"Industrial 4.0"is still a very serious problem.However,the medical data is often high-dimensional,unbalanced and noisy,so this paper proposed a new data processing method combining SMOTE method with Filter-Wrapper feature selection algorithm to support clinical decision-making.In particular,the proposed method not only overcomes the situation of bad prediction result of the independent assumptions in the practical attribute application of Naive Bayesian,but also avoids over-fitting problem caused by constructing the model of C4.5 decision tree.What's more,when the proposed algorithm is applied to ECG clinical decision-making,good results can be obtained.

引文

[1]CHENG Y Y,QU H B,ZHANG B L.Chinese medicine industry4.0:advancing digital pharmaceutical manufacture toward intelligent pharmaceutical manufacture[J].China Journal of Chinese Materia Medica,2016,41(1):1.
    [2]LI X,LI D,WAN J,et al.A review of industrial wireless networks in the context of Industry 4.0[J].Wireless Networks,2017,23(1):23-41.
    [3]WILK S,SLOWINSKI R,MICHALOWSKI W,et al.Supporting triage of children with abdominal Pain in the emergency room[J].European Journal of Operationl Research,2005,160(3):696-709.
    [4]CHEN J M,SUN Y X.Experiments study on a dynamic priority scheduling for wireless sensor networks[C]∥Proceedings of Mobile Ad-hoc and Sensor Networks.Wuhan,2005:613-622.
    [5]QUINLAN J R.Induction of decision tree[J].Machine Learning,1986,1(1):81-106.
    [6]QUINLAN J R.Learning Efficient Classification Procedures and Their Application to Chess End Games[M]∥Machine Learning.Springer Berlin Heidelberg,1984.
    [7]MICHALSKI R S,CARBONELL J G,MITCHELL T M.Machine learning:an artificial intelligence approach[M].London:Morgan Kaufmann,1984:463-482.
    [8]PALACIOS-ALONSO M A,BRIZUELA C A,SUCAR L E.Evolutionary learning of dynamic Nave Bayesian classifiers[J].Journal of Automated Reasoning,2010,45(1):21-37.
    [9]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2011,16(1):321-357.
    [10]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning[C]∥Proceedings of the 2005International Conference on Intelligent Computing.Berlin:Springer Press,2005:878-887.
    [11]YEN S J,LEE Y S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727.
    [12]BATISTA G,PRATI R C,MONARD M C.A study of the behaviour of several methods for balancing machine learning training data[J].SIGKDD Explor,2004,6(1):20-29.
    [13]边肇祺,张学工.模式识别(第2版)[M].北京:清华大学出版社,2000.
    [14]LANGLEY P.Selection of relevant features in machine learning[C]∥Proceedings of the AAAI Fall Symposium on Relevance.New Orleans,1994:1-5.
    [15]ZHOU X B,WANG X D,DOUGHERTY E R.Nonlinear-Probit Gene Classification Using Mutual Information and WaveIetBased Feature Selection[J].Biological Systems,2004,12(3):371-386.
    [16]SINDHWANI V,RAKSHIT S,DEODHARE D,et al.Feature Selection In MLPs and SVMs Based on Maximum Output Information[J].IEEE Transactions on Neural Networks,2004,15(4):937-948.
    [17]HSU W H.Genetic wrappers for feature selection in decision tree induction and variable ordering in Bayesian network structure learning[J].Information Sciences,2004,163(17):103-122.
    [18]LI L,WEINBERG C R,DARDEN T A,et al.Gene Selection for Sample Classification Based on Gene Expression Data:Study of Sensitivity to Choice of Parameters of the GA/KNN Method[J].Bioinformatics,2001,17(12):1131-1142.
    [19]INZA l,LARRANAGA P,BLANCO E R,et al.Filter Versus Wrapper Gene Selection Approaches in DNA Microarray Domains[J].Artificial Intelligence in Medicine,2004,31(2):91-103.
    [20]ZHANG Y Y,XIANG Y,JIANG R Q,et al.Analysis and Implementation of Map Reduce Parallelization of Naive Bayes Algorithm[J].Computer Technology and Development,2013,23(3):23-26.(in Chinese)张依杨,向阳,蒋锐权,等.朴素贝叶斯算法的MapReduce并行化分析与实现[J].计算机技术与发展,2013,23(3):23-26.
    [21]DOMINGOS P,PAZZANI M J.On The Optimality of The Simple Bayesian Classifier under Zero-One Loss[J].Machine Learning,1997,29(2/3):103-130.
    [22]QUINLAN J R.Induction of decision trees[J].Machine Learning,1986,1(1):81-106.
    [23]SEGAL I E A.note on the concept of entropy[J].Journal of Mathematics and Mechanics,1960,9(4):623-629.
    [24]QUINLAN J R.C4.5:Programming for machine learning[M].London,Morgan Kauffmann,1993.
    [25]BREIMAN L,FRIEDMAN J H,STONE C J,et al.Classification and regression trees[M].Chapman and Hall,1984.
    [26]FAN J C,ZHANG W Y,LIANG Y Q.Decision tree classification algorithm based on Bayesian method[J].Journal of Computer Applications,2005,25(12):2882-2884.(in Chinese)樊建聪,张问银,梁永全.基于贝叶斯方法的决策树分类算法[J].计算机应用,2005,25(12):2882-2884.
    [27]FRANK A,ASUNCION A.UCI Machine Learning Repository[DB/OL].http://archive.ics.uci.edu/ml/Irvine,CA:University of California,School of Information and Computer Science.
    [28]YANG L Y,ZHANG J Y,WANG W J.Selecting and Combining Classifiers Simultaneously with Particle Swarm Optimization[J].Information Technology Journal,2009,8(2):241-245.
    [29]SINGH R G,PANDEY A.The Impact of Randomization on Circular-Complex Extreme Learning Machine for Real Valued Classification Problems[J].International Journal of Computer Applications,2014,103(2):1-7.
    [30]LIPITAKIS A D,ANTZOULATOS G S,KOTSIANTIS S,et al.Integrating global and local boosting[C]∥2015 6th International Conference on Information,Intelligence,Systems and Applications(IISA).IEEE,2015:1-6.
    [31]RAHMAN A,VERMA B.A novel ensemble classifier approach using weak classifier learning on overlapping clusters[C]∥International Joint Conference on Neural Networks.IEEE,2010:1-7.
    [32]COELHO A L V,NASCIMENTO D S C.On the evolutionary design of heterogeneous bagging models[J].Neuro Computing,2010,73(16):3319-3322.
    [33]CHEN J,JI S,CERAN B,et al.Learning subspace kernels for classification[C]∥Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2008:106-114.
    [34]DO T N,POULET F.Enhancing svm with visualization[C]∥International Conference on Discovery Science.Springer Berlin Heidelberg,2004:183-194.
    [35]QUINLAN J R.Bagging,boosting,and C4.5[C]∥Association for the Advancement of Artificial Intelligence.1996:725-730.
    [36]CLARK P,BOSWELL R.Rule induction with CN2:Some recent improvements[C]∥European Working Session on Learning.Springer Berlin Heidelberg,1991:151-163.
    [37]JO H,NA Y,OH B,et al.Attribute value taxonomy generation through matrix based adaptive genetic algorithm[C]∥20th IEEE International Conference on Tools with Artificial Intelligence.IEEE,2008,1:393-400.
    [38]SAEED A A,CAWLEY G C,BAGNALL A.Benchmarking the semi-supervised na6ve Bayes classifier[C]∥International Joint Conference on Neural Networks.IEEE,2015:558-561.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700