面向不平衡数据分类的KFDA-Boosting算法

英文篇名：KFDA-Boosting algorithm oriented to imbalanced data classification
作者：王来 ; 樊重俊 ; 杨云鹏 ; 袁光辉
英文作者：Wang Lai;Fan Chongjun;Yang Yunpeng;Yuan Guanghui;Business School,University of Shanghai for Science & Technology;School of Information Management & Engineering,Shanghai University of Finance & Economics;Experimental Center,Shanghai University of Finance & Economics;
关键词：核费希尔判别分析 ; 集成学习 ; 不平衡数据 ; 分类
英文关键词：kernel Fisher discriminant analysis;;ensemble learning;;imbalanced data;;classify
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：上海理工大学管理学院;上海财经大学信息管理与工程学院;上海财经大学实验中心;
出版日期：2018-02-09 12:30
出版单位：计算机应用研究
年：2019
期：v.36;No.329
基金：国家自然科学基金资助项目(71303157);; 上海市教育委员会科研创新重点基金项目(14ZZ131);; 上海市一流学科资助基金项目(S1205YLXK);; 上海市社科规划青年课题基金项目(2014EGL007);; 沪江基金资助项目(D14008)
语种：中文;
页：JSYJ201903034
页数：5
CN：03
ISSN：51-1196/TP
分类号：174-178

摘要

数据分布的不平衡性和数据特征的非线性增加了分类的困难,特别是难以识别不平衡数据中的少数类,从而影响整体的分类效果。针对该问题,结合KFDA(kernel Fisher discriminant analysis)能有效提取样本非线性特征的特性和集成学习中Boosting算法的思想,提出了KFDA-Boosting算法。为了验证该算法对不平衡数据分类的有效性和优越性,以G-mean值、少数类的查准率与查全率作为分类效果的评价指标,选取了UCI中10个数据集测试KFDA-Boosting算法性能,并与支持向量机等六种分类算法进行对比实验。结果表明,对于不平衡数据分类,尤其是对不平衡度较大或呈非线性特征的数据,相比于其他分类算法,KFDA-Boosting算法能有效地识别少数类,并且在整体上具有显著的分类效果和较好的稳定性。
The imbalance of data distribution and the nonlinearity of data characteristics increase the difficulty of classification,especially the recognition of the minority class samples in the imbalanced data,thus affecting the overall classification effect. For the above problem,this paper proposed an algorithm called KFDA-Boosting,which combined the characteristic of KFDA,namely kernel fisher discriminant analysis,effectively extracted the samples' nonlinear features and the idea of Boosting algorithm in the ensemble learning. In order to verify the effectiveness and superiority of the algorithm in the classification of imbalanced data,the paper used the G-mean value,the precision and recall of the minority class samples to evaluate the performance of classifier,and selected 10 datasets of UCI to test the KFDA-Boosting algorithm,which compared with other six algorithms,such as support vector machine. Compared with other algorithms,the results show that the algorithm can effectively identify the minority class,and has a significant effect on the classification of imbalanced data and better stability on the whole,especially for the data with larger unbalance degree or nonlinear characteristics.

引文

[1] Laurikkala J. Improving identification of difficult small classes bybalancing class distribution[C]//Proc of the 8th Conference on AI inMedicine. Berlin:Springer-Verlag,2001:63-66.
    [2] Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:synthetic minorityover-sampling technique[J]. Artificial Intelligence Research,2002,16(3):321-357.
    [3]郑文昌,陈淑燕,王宣强.面向不平衡数据集的SMOTE-SVM交通事件检测算法[J].武汉理工大学学报,2012,34(11):58-62,123.(Zheng Wenchan,Chen Shuyan,Wang Xuanqiang. Imbalanceddatasets based SMOTE-SVM-AID algorithm[J]. Journal of WuhanUniversity of Technology,2012,34(11):58-62,123.)
    [4]衣柏衡,朱建军,李杰.基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类[J].中国管理科学,2016,24(3):24-30.(Yi Baiheng,Zhu Jianjun,Li Jie. Imbalanced data classificationon micro-credit company customer credit risk assessment using im-proved smote support vector machine[J]. Chinese Journal of Man-agement Science,2016,24(3):24-30.)
    [5]杨毅,卢诚波,徐根海.面向不平衡数据集的一种精化Borderline-SMOTE方法[J].复旦学报:自然科学版,2017,56(5):537-544.(Yang Yi,Lu Chengbo,Xu Genhai. A refined Borderline-SMOTEmethod for imbalanced data set[J]. Journal of Fudan University:Natural Science,2017,56(5):537-544.)
    [6]蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390.(JiangShengyi,Xie Zhaoqing,Yu Wen. Naive Bayes classification algorithmbased on cost sensitive for imbalanced data distribution[J]. Journalof Computer Research and Development,2011,48(S1):387-390.)
    [7]李勇,刘战东,张海军.不平衡数据的集成分类算法综述[J].计算机应用研究,2014,31(5):1287-1291.(Li Yong,Liu Zhan Dong,Zhang Haijun. Review on ensemble algorithms for unbalanced dataclassification[J]. Application Research of Computers,2014,31(5):1287-1291.)
    [8]邹鹏,莫佳卉,江亦华,等.基于代价敏感决策树的客户价值细分[J].管理科学,2011,24(2):20-29.(Zou Peng,Mo Jiahui,KiangMelody,et al. A cost-sensitive decision tree learning model:an appli-cation to customer value based segmentation[J]. Journal of Ma-nagement Science,2011,24(2):20-29.)
    [9]师彦文,王宏杰.基于新型不纯度度量的代价敏感随机森林分类器[J].计算机科学,2017,44(S2):98-101.(Shi Yanwen,WangHongjie. Cost-sensitive random forest classifier with new impuritymeasurement[J]. Computer Science,2017,44(S2):98-101.)
    [10]Schapire R E. The strength of weak learnability[J]. Machine Lear-ning,1990,5(2):197-227.
    [11] Breiman L. Bagging predictors[J]. Machine Learning,1996,24(2):123-140.
    [12] Li Kewen,Fang Xianghua,Zhai Jiannan,et al. An imbalanced dataclassification method driven by boundary samples-boundary-boost[C]//Proc of International Conference on Information Science andControl Engineering. Piscataway,NJ:IEEE Press,2016:194-199.
    [13]胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报,2016,11(2):257-263.(Hu Xiaosheng,WenJuping,Zhong Yong. Imbalanced data ensemble classification usingdynamic balance sampling[J]. CAAI Trans on Intelligent Sys-tems,2016,11(2):257-263.)
    [14]秦孟梅,邱建林,陆鹏程,等.基于AdaBoost的类不平衡学习算法[J].计算机应用研究,2017,34(11):3229-3232,3254.(QinMengmei,Qiu Jianlin,Lu Pengcheng,et al. Ada Boost-based class im-balance learning algorithm[J]. Application Research of Compu-ters,2017,34(11):3229-3232,3254.)
    [15]应维云,蔺楠,谢雅雅,等.用LDA Boosting算法进行客户流失预测[J].数理统计与管理,2010,29(3):400-408.(Ying Weiyun,Lin Nan,Xie Yaya,et al. Research on the LDA Boosting in customerchurn prediction[J]. Journal of Applied Statistics and Manage-ment,2010,29(3):400-408.)
    [16]李诒靖,郭海湘,李亚楠,等.一种基于Boosting的集成学习算法在不均衡数据中的分类[J].系统工程理论与实践,2016,36(1):189-199.(Li Yijing,Guo Haixiang,Li Yanan,et al. A Boostingbased ensemble learning algorithm in unbalanced data classification[J]. Systems Engineering-Theory&Practice,2016,36(1):189-199.)
    [17]王璐林.面向不平衡样本的Boosting分类算法研究[D].哈尔滨:哈尔滨工业大学,2013.(Wang Lulin. Research of Boosting classi-fication algorithm for imbalanced data[D]. Harbin:Harbin Instituteof Technology,2013.)
    [18]李想. Boosting分类算法的应用与研究[D].兰州:兰州交通大学,2012.(Li Xiang. Research on classification algorithm of Boostingand its applications[D]. Lanzhou:Lanzhou Jiaotong University,2012.)
    [19]常志朋,程龙生.核Fisher判别分析多参数自动优化算法[J].系统工程与电子技术,2013,35(1):212-217.(Chang Zhipeng,Cheng Longsheng. Automatic optimization algorithm of multiple para-meters for kernel Fisher discriminant analysis[J]. SystemsEngineering and Electronics,2013,35(1):212-217.)
    [20]李建云,邱菀华.核Fisher判别分析方法评估消费者信用风险[J].系统工程理论方法应用,2004,13(6):548-552,556.(LiJianyun,Qiu Wanhua. Evaluation consumer credit with kernel Fisherdiscriminant analysis[J]. Systems Engineering-Theory Methodo-logy Applications,2004,13(6):548-552,556.)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700