摘要
针对信贷行业信用评分业务中存在的样本类别不平衡问题,首先在信用评分各影响因素Fisher比率值分析的基础上确定主要评判指标;而后以基于支持度的过采样算法(SDSMOTE)为样例合成算法,支持向量机(SVM)为基预测器,Boosting算法为框架,构建基于Fisher-SDSMOTE-ESBoostSVM的类别不平衡信用评分预测模型;在基分类器训练结束后引入淘汰策略,删除未被正确分类的合成样例,重新生成正类样例并修正样例权重;最后以UCI数据库中德国信用数据集为实验样本,F-measure值和G-mean值为评价指标,对比分析FisherSDSMOTE-ESBoostSVM与其他集成学习算法的预测结果。实验结果表明,Fisher-SDSMOTE-ESBoostSVM算法应用到信贷行业客户信用评分预测中具有可行性和适应性,且预测准确率较高,具有一定的实际应用价值。
In view of class-imbalance in real credit scoring business of credit industry,this paper firstly determined the main evaluation indicators of credit scoring based on a comprehensive analysis of the influence factors' Fisher ratio value. Then,it chose the SMOTE based on support degree( SDSMOTE) oversampling algorithm to synthesize new samples,SVM played as the base predictor and Boosting algorithm as the framework,this paper proposed a credit scoring prediction model which associated class-imbalance with Fisher-SDSMOTE-ESBoostSVM theory. Besides,it introduced the elimination strategy to delete the synthetic sample which was not classified accurately,after that synthesized the new positive class sample again and modified the sample weight. Finally,it selected the German credit dataset in the UCI database as the experimental dataset,and Fmeasure value and G-mean value as evaluation standard,comparing and analyzing the prediction result of Fisher-SDSMOTEESBoostSVM model and others ensemble learning algorithm. Experimental results show that the application of FisherSDSMOTE-ESBoostSVM algorithm to customer credit score prediction is feasible and applicable,and show a high level of accuracy,which proved that the algorithm has a certain practical application value.
引文
[1]张婷婷. Logistic回归及其相关方法在个人信用评分中的应用[D].太原:太原理工大学,2017.(Zhang Tingting. The application of logistic regression and related methods in personal credit scoring[D]. Taiyuan:Taiyuan University of Technology,2017.)
[2]陆爱国,王珏,刘红卫.基于改进的SVM学习算法及其在信用评分中的应用[J].系统工程理论与实践,2012,32(3):515-521.(Lu Aiguo,Wang Yu,Liu Hongwei. An improved SVM learning algorithm and its applications to credit scorings[J]. Systems Engineering-Theory&Practice,2012,32(3):515-521.)
[3]陈启伟,王伟,马迪,等.基于Ext-GBDT集成的类别不平衡信用评分模型[J].计算机应用研究,2018,35(2):421-427.(Chen Qiwei,Wang Wei,Ma Di,et al. Class-imbalance credit scoring using Ext-GBDT ensemble[J]. Application Research of Computers,2018,35(2):421-427.)
[4] Herrera F. On the use of map reduce for imbalanced big data using random forest[J]. Information Sciences,2014,285(3):112-137.
[5] Blake C L,Merz C J. UCI Repository of machine learning databases[D]. Irvine,CA:University of California,1998.
[6] Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research,2002,16(1):321-357.
[7] Han Hui,Wang Wenyuan,Mao Binghuan. Borderline-SMOTE:a new oversampling method in imbalanced data sets learning[C]//Proc of International Conference on Intelligent Computing. Berlin:Springer,2005:878-887.
[8] Nakamura M,Kajiwara Y,Otsuka A,et al. LVQ-SMOTE-learning vector quantization based synthetic minority over-sampling technique for biomedical data[J]. Bio Data Mining,2013,6(1):16.
[9]郭明玮,赵宇宙,项俊平,等.基于支持向量机的目标检测算法综述[J].控制与决策,2014,29(2):193-200.(Guo Mingwei,Zhao Yuzhou,Xiang Junping,et al. Review of object detection methods based on SVM[J]. Control and Decision,2014,29(2):193-200.)
[10]徐乾,王文剑,张文浩.处理非平衡数据的粒度SVM学习方法[J].计算机工程与应用,2011,47(24):97-99,114.(Xu Qian,Wang Wenjian,Zhang Wenhao. Granular support vector machine approach used for imbalanced data[J]. Computer Engineering and Applications,2011,47(24):97-99,114.)
[11]李诒靖,郭海湘,李亚楠,等.一种基于Boosting的集成学习算法在不均衡数据中的分类[J].系统工程理论与实践,2016,36(1):189-199.(Li Yijing,Guo Haixiang,Li Yanan,et al. A Boosting based ensemble learning algorithm in imbalanced data classification[J]. Systems Engineering-Theory&Practice,2016,36(1):189-199.)
[12]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):2202-2209.(Li Xiongfei,Li Jun,Dong Yuanfang,et al. A new learning algorithm for imbalanced dataPCBoost[J]. Chinese Journal of Computers,2012,35(2):2202-2209.)
[13]黄海松,魏建安,康佩栋.基于不平衡数据样本特性的新型过采样SVM分类算法[J].控制与决策,2018,33(9):1549-1558.(HuangHaisong,Wei Jian’an,Kang Peidong. New over-sampling SVM classification algorithm based on unbalanced data sample characteristics[J]. Control and Decision,2018,33(9):1549-1558.)
[14]赵清华,张艺豪,马建芬,等.改进SMOTE的非平衡数据集分类算法研究[J].计算机工程与应用,2018,54(18):168-173.(Zhao Qinghua,Zhang Yihao,Ma Jianfen,et al. Research on classification algorithm of imbalanced datasets based on improved SMOTE[J].Computer Engineering and Applications,2018,54(18):168-173.)
[15]周绍磊,廖剑,史贤俊.基于Fisher准则和最大熵原理的SVM核参数选择方法[J].控制与决策,2014,29(11):1991-1996.(Zhou Shaolei,Liao Jian,Shi Xianjun. SVM parameters selection method based on Fisher criterion and maximum entropy principle[J]. Control and Decision,2014,29(11):1991-1996.)
[16]古平,欧阳源遊.基于混合采样的非平衡数据集分类研究[J].计算机应用研究,2015,32(2):379-381,418.(Gu Ping,Ouyang Yuanyou. Classification research for unbalanced data based on mixedsampling[J]. Application Research of Computers,2015,32(2):379-381,418.)
[17]陶新民,郝思媛,张冬雪,等.基于样本特性欠取样的不均衡支持向量机[J].控制与决策,2013,28(7):978-984.(Tao Xinmin,Hao Siyuan,Zhang Xuedong,et al. Support vector machine for unbalanced data based on sample properties under-sampling approaches[J].Control and Decision,2013,28(7):978-984.)
[18]韩璐,韩立岩.正交支持向量机及其在信用评分中的应用[J].管理工程学报,2017,31(2):128-136.(Han Lu,Han Liyan. Orthogonal support vector machine and its application in credit scoring[J].Engineering Management,2017,31(2):128-136.)