新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost

英文篇名：NIBoost: new imbalanced dataset classification method based on cost sensitive ensemble learning
作者：王莉 ; 陈红梅 ; 王生武
英文作者：WANG Li;CHEN Hongmei;WANG Shengwu;School of Information Science and Technology, Southwest Jiaotong University;
关键词：非平衡数据集 ; 分类 ; 代价敏感 ; 过采样 ; Adaboost算法
英文关键词：imbalanced dataset;;classification;;cost sensitive;;over-sampling;;Adaboost algorithm
中文刊名：JSJY
英文刊名：Journal of Computer Applications
机构：西南交通大学信息科学与技术学院;
出版日期：2019-03-10
出版单位：计算机应用
年：2019
期：v.39;No.343
基金：国家自然科学基金资助项目(61572406)~~
语种：中文;
页：JSJY201903003
页数：5
CN：03
ISSN：51-1307/TP
分类号：13-17

摘要

现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法——NIBoost (New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。
The problem of misclassification of minority class samples appears frequently when classifying massive amount of imbalanced data in real life with traditional classification algorithms, because most of these algorithms only suit balanced class distribution or samples with same misclassification cost. To overcome this problem, a classification algorithm for imbalanced dataset based on cost sensitive ensemble learning and oversampling — New Imbalanced Boost(NIBoost) was proposed. Firstly, the oversampling algorithm was used to add a certain number of minority samples to balance the dataset in each iteration, and the classifier was trained on the new dataset. Secondly, the classifier was used to classify the dataset to obtain the predicted class label of each sample and the classification error rate of the classifier. Finally, the weight coefficient of the classifier and new weight of each sample were calculated according to the classification error rate and the predicted class labeles. Experimental results on UCI datasets with decision tree and Naive Bayesian used as weak classifier algorithm show that when decision tree was used as the base classifier of NIBoost, compared with RareBoost algorithm, the F-value is increased up to 5.91 percentage points, the G-mean is increased up to 7.44 percentage points, and the AUC is increased up to 4.38 percentage points. The experimental results show that the proposed algorithm has advantages on imbalanced data classification problem.

引文

[1]WEISS G M,ZADROZNY B,SAAR M.Guest editorial:special issue on utility-based data mining[J].Data Mining and Knowledge Discovery,2008,17(2):129-135.
    [2]del CASTILLO M D,SERRANO J I.A multistrategy approach for digital text categorization from imbalanced documents[J].ACMSIGKDD Explorations Newsletter,2004,6(1):70-79.
    [3]WEI W,LI J,CAO L.Effective detection of sophisticated online banking fraud on extremely imbalanced data[J].World Wide Web,2013,16(4):449-475.
    [4]江颉,王卓芳,GONG R S,等.不平衡数据分类方法及其在入侵检测中的应用研究[J].计算机科学,2013,40(4):131-135.(JIANG J,WANG Z F,GONG R S,et al.Imbalanced data classification method and its application research for intrusion detection[J].Computer Science,2013,40(4):131-135.)
    [5]KUBAT M,HOLTE RC,MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2):195-215.
    [6]SCHAEFER G,NAKASHIMA T.Strategies for addressing class imbalance in ensemble classification of thermography breast cancer features[C]//Proceedings of the 2015 IEEE Congress on Evolutionary Computation.Piscataway,NJ:IEEE,2015:2362-2367.
    [7]CHAWLA N V,BOWYER K W,HALL L O.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
    [8]QIAN Y,LIANG Y,LI M.A resampling ensemble algorithm for classification of imbalance problems[J].Neurocomputing,2014,143:57-67.
    [9]DOUZAS G,BACAO F,LAST F.Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE[J].Information Sciences,2018,465:1-20.
    [10]GALAR M,FERNANDEZ A,BARRENECHEA E.Orderingbased pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets[J].Information Sciences,2016,354:178-196.
    [11]ZHANG Y,ZHANG D,MI G.Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions[J].Computational Biology and Chemistry,2012,36(2):36-41.
    [12]KIM M J,KANG D K,KIM H B.Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for值的比较ms on five datasets NBst Rare Boost NIBoost6 0.548 6 0.644 62 0.674 6 0.741 08 0.948 8 0.965 02 0.616 6 0.659 48 0.746 2 0.771 47 0.707 0 0.756 32 0.676 8 0.879 82 0.751 1 0.863 44 0.965 6 0.977 08 0.695 2 0.734 66 0.812 4 0.852 80 0.780 2 0.861 56 0.727 4 0.888 88 0.776 7 0.865 04 0.965 6 0.977 00 0.711 1 0.735 24 0.828 7 0.861 42 0.801 9 0.865 5bankruptcy prediction[J].Expert Systems with Applications,2015,42(3):1074-1082.
    [13]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209.(LI X F,LI J,DONG Y F,et al.A new learning algorithm for imbalanced data PCBoost[J].Chinese Journal of Computers,2012,35(2):202-209.)
    [14]HE H,ZHANG W,ZHANG S.A novel ensemble method for credit scoring:adaption of different imbalance ratios[J].Expert Systems with Applications,2018,98:105-117.
    [15]付忠良.多标签代价敏感分类集成学习算法[J].自动化学报,2014,40(6):1075-1085.(FU Z L.Cost-sensitive ensemble learning algorithm for multi-label classification problems[J].Acta Automatica Sinica,2014,40(6):1075-1085.)
    [16]FAN W,STOLFO S J,ZHANG J,et al.Ada Cost:misclassification cost-sensitive boosting[C]//ICML'99:Proceedings of the 16th International Conference on Machine Learning.San Francisco,CA:Morgan Kaufmann Publishers,1999:97-105.
    [17]SUN Y,KAMEL M S,WONG A K C.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378.
    [18]JOSHI M V,KUMAR V,AGARWAL R.Evaluating boosting algorithms to classify rare classes:comparison and improvements[C]//Proceedings of the 2001 IEEE International Conference on Data Mining.Piscataway,NJ:IEEE,2001:257-264.
    [19]SIERS M J,ISLAM M Z.Software defect prediction using a cost sensitive decision forest and voting,and a potential solution to the class imbalance problem[J].Information Systems,2015,51:62-71.
    [20]ZHANG Y,WANG D.A cost-sensitive ensemble method for classimbalanced data sets[J].Abstract and Applied Analysis,2013,2013:Article ID 196256.
    [21]AODHA O M,BROSTOW G J.Revisiting example dependent cost-sensitive learning with decision trees[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision.Piscataway,NJ:IEEE,2013:193-200.
    [22]GALAR M,FERNANDEZ A,BARRENECHEA E.EUSBoost:enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling[J].Pattern Recognition,2013,46(12):3460-3471.
    [23]ROY N K S,ROSSI B.Cost-sensitive strategies for data imbalance in bug severity classification:experimental results[C]//Proceedings of the 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications.Washington,DC:IEEE Computer Society,2017:426-429.
    [24]LEE H K,KIM S B.An overlap-sensitive margin classifier for imbalanced and overlapping data[J].Expert Systems with Applications,2018,98:72-83.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700