一种改进过采样的不平衡数据集成分类算法

英文篇名：Over-sampling Based Ensemble Classification Algorithm on Imbalanced Data
作者：张菲菲 ; 王黎明 ; 柴玉梅
英文作者：ZHANG Fei-fei;WANG Li-ming;CHAI Yu-mei;School of Information Engineering,Zhengzhou University;
关键词：不平衡数据 ; 子簇划分 ; 概率分布 ; 过采样 ; AdaBoost
英文关键词：imbalance data;;sub-clusters division;;probability distribution;;over-sampling;;AdaBoost
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：郑州大学信息工程学院;
出版日期：2018-10-15
出版单位：小型微型计算机系统
年：2018
期：v.39
基金：国家自然科学基金项目(U1636111)资助
语种：中文;
页：XXWX201810006
页数：7
CN：10
ISSN：21-1106/TP
分类号：36-42

摘要

不平衡数据分类是机器学习和数据挖掘的重要环节.类分布不均衡和类中"困难样本"会导致许多传统分类算法效果不理想.为此,本文提出一种改进过采样的不平衡数据集成分类算法,一方面利用多数类样本划分少数类样本为不同子簇,充分考虑类间与类内数据的不平衡,根据子簇的概率分布进行过采样,并且对过采样后的样本及时进行修正,保证合成样本质量;另一方面利用AdaBoost算法处理不平衡数据的优势,采用决策树作为基本分类器,在每次迭代初始利用过采样方法合成样本,平衡训练信息,得到最终分类模型. 7组UCI数据实验表明改进过采样的不平衡数据集成分类算法可以显著提高分类的精度,进而提升分类器的性能.
Imbalanced data classification is an important part of machine learning and data mining. Conventional classification algorithms present bad effects on class-imbalanced distribution and"hard-to-learn"examples. In this paper,an imbalanced data ensemble classification algorithm based on Over-sampling is proposed. On the one hand,the algorithm uses majority samples information to divide minority samples for different sub-clusters,the distribution of between-class and within-class imbalanced data is fully taken into account when oversampling with SM OTE based on the probability distribution,and the synthetic examples are corrected in a timely manner to ensure their quality. On the other hand,AdaBoost is used to take its advantage of dealing with imbalanced data,and regards the decision tree as a basic classifier. At the beginning of each iteration,the algorithm makes use of over-sampling to add synthetic minority class examples in order to balance training information,and achieves the final classification model. Seven groups data experiments certificate that the imbalanced data ensemble classification algorithm based on improved over-sampling can improve the accuracy of classification and improve the performance of classifier.

引文

[1]Zieba M,Tomczak J M,Lubicz M,et al. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients[J]. Applied Soft Computing,2014,14(1):99-108.
    [2]Pun J,Lawryshyn Y. Improving credit card fraud detection using a meta-classification strategy[J]. International Journal of Computer Applications,2012,56(10):41-46.
    [3]Kang Song-lin,Fan Xiao-ping,Liu Le,et al. Research on P2P botnets detection based on the ENN_ADASYN-SVM classification algorithm[J]. Journal of Chinese Computer Systems,2016,37(2):216-220.
    [4]Ziba M,Tomczak S K,Tomczak J M. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction[J]. Expert Systems with Applications,2016,58(C):93-101.
    [5]Li Xiong-fei,Li Jun,Dong Yuan-fang,et al. A New Learning algorithm for imbalanced data-PCBoost[J]. Chinese Journal of Computers,2012,35(2):202-209.
    [6] Stefanowski J. Dealing with data difficulty factors while learning from imbalanced data[M]. Challenges in Computational Statistics and Data Mining,Springer International Publishing,2016,605:333-363.
    [7]Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research,2002,16(1):321-357.
    [8]Han H,Wang W Y,Mao B H. Borderline-SMOTE:a new oversampling method in imbalanced data sets learning[J]. Lecture Notes in Computer Science,2005,3644(5):878-887.
    [9]He H,Bai Y,Garcia E A,et al. ADASYN:adaptive synthetic sampling approach for imbalanced learning[C]. IEEE International Joint Conference on Neural Netw orks,IEEE,2008:1322-1328.
    [10]Zhai Yun,Wang Shu-peng,Ma Nan,et al. A data mining method for imbalanced datasets based on one-sided link and distribution density of instances[J]. Acta Electronica Sinica,2014,42(7):1311-1319.
    [11] Barua S,Islam M M,Yao X,et al. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge&Data Engineering,2014,26(2):405-425.
    [12] Freund,Yoav,Schapire,Robert E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer&System Sciences,1997,55(1):119-139.
    [13]Chawla N V,Lazarevic A,Hall L O,et al. SMOTEBoost:improving prediction of the minority class in boosting[C]. Know ledge Discovery in Database:Pkdd 2003,European Conference on Principles and Practice,2003.
    [14]Seiffert C,Khoshgoftaar T M,Van Hulse J,et al. RUSBoost:improving classification performance w hen training data is skew ed[C].International Conference on Pattern Recognition,IEEE,2012.
    [15]Li Ke-wen,Yang Lei,Liu Wen-ying,et al. Classification method of imbalanced data based on RSBoost[J]. Computer Science,2015,42(9):249-252.
    [16]Nekooeimehr I,Lai-Yuen S K. Adaptive semi-unsupervised weighted oversampling(A-SUWO)for imbalanced datasets[J]. Expert Systems w ith Applications,2016,46(c):405-416.
    [17] Muhlenbach F,Lallich S,Zighed D A. Identifying and handling mislabelled instances[J]. Journal of Intelligent Information Systems,2004,22(1):89-109.
    [3]康松林,樊晓平,刘乐,等. ENN-ADASYN-SVM算法检测P2P僵尸网络的研究[J].小型微型计算机系统,2016,37(2):216-220.
    [5]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209.
    [10]翟云,王树鹏,马楠,等.基于单边选择链和样本分布密度融合机制的非平衡数据挖掘方法[J].电子学报,2014,42(7):1311-1319.
    [15]李克文,杨磊,刘文英,等.基于RSBoost算法的不平衡数据分类方法[J].计算机科学,2015,42(9):249-252.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700