用户名: 密码: 验证码:
非均衡数据下基于信息增益的SMOTE改进SVM模型研究
详细信息    查看官网全文
摘要
针对传统支持向量机在数据非均衡的情况下分类效果很不理想的问题,提出一种基于关键指标过采样的非均衡支持向量机分类算法。依据信息增益理论确定样本的关键指标,建立了基于区间数的关键指标扩展方法,利用超立方体顶点采样方法对扩展后的样本进行了过采样,进而使少数类样本的数量得到均衡;最后建立SVM分类模型并对区间化指标进行寻优,进而确定最终分类结果。实验结果表明,所提出算法相对其他非均衡算法能有效提高分类性能,尤其样本指标较多的情况下,本文算法优势更为明显。
Aiming at the problem that the traditional support vector machine classification results are very poor in the case of data non equilibrium,a new algorithm of non balanced support vector machine classification based on key index resampling is proposed.Firstly,the algorithm is based on the information entropy to determine the key index of the sample,then the key index is extended by using the interval number,then the extended sample is weighed,and the number of samples is obtained by using the method of hypercube sampling.Experimental results show that the proposed algorithm can effectively improve the classification performance of the minority class samples with the proposed algorithm.
引文
[1]Boser B E.Guyon I M,Vapnik V N.A training algorithm for optimal margin classifiers[C]//Proceedings of the fifth annual workshop on Computational learning theory.ACM,1992:144-152.
    [2]Chen F L,Li F C.Combination of feature selection approaches with SVM in credit scoring[J].Expert Systems with Applications,2010,37(7):4902-4909.
    [3]Farquad M A H.Ravi V.Raju S B.Analytical CRM in banking and finance using SVM:a modified active learning-based rule extraction approach[J].International Journal of Electronic Customer Relationship Management,2012,6(1):48-73.
    [4]Bartlett M S.Littlewort G,Lainscsek C,et al.Machine learning methods for fully automatic recognition of facial expressions and facial actions[C].2004 IEEE International Conference on.Systems,Man and Cybernetics,2004,1:592-597.
    [5]Koknar-Tezel S.Latecki L J.Improving SVM classification on imbalanced time series data sets with ghost points[J].Knowledge and information systems,2011,28(1):1-23.
    [6]Wang Xiaoguang,Liu Xuan,Matwin,et al.Applying instance-weighted support vector machines to class imbalanced datasets[C]//Proceedings of 2014 IEEE International Conference on Big Data(Big Data):Washington DC,USA,October.2014.112-118.
    [7]Breiman L.Bagging predictors[J].Machine Learning,1996,24(2):274-30.
    [8]Freund Y.Schapire R E.Experiments with a new boosting algorithm[C]//Procedings of the thirteenth.1996.Conference on Machine Learning,Bali Italy.
    [9]Galar M,Fernandez A,Barrenechea E,et al.A review on ensembles for the class imbalance problem;Bagging-,boosting-,and hybrid-based approaches[J].IEEE Transactions on,Systems,Man and Cybernetics,Part C:Applications and Reviews,2012,42(4):46
    [10]Wang Juanjuan,Xu Mantao,Wang Hui.et al.Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding[C].//Proceedings of 8th International Conference on Signal Processing,Noverber 16-20,2006.
    [11]陶新民,郝思媛,张冬雪,等.基于样本特性欠取样的不均衡支持向量机[J].控制与决策,2013,28(7):978-984.
    [12]李正欣,赵林度.基于SMOTEBoost的非均衡数据集SVM分类器[J].系统工程,2008,26(5):116-119.
    [13]林宇,黄迅,徐凯.基于RU-SMOTE-SVM的金融市场极端风险预警研究[J].预测,2013,32(4):15-20.
    [14]Wang Gang.Asymmetric random subspace method for imbalanced credit risk evaluation[M].Software Engineering and Knowledge Engineering:Theory and Practice.Springer Berlin Heidelberg,2012.
    [15]Yin Hongli,Leong T Y.A model driven approach to imbalanced data sampling in medical decision making[J].Stud Health Technology and Informatics,2010,160(Pt2):856-860.
    [16]Sarakit P,Theeramunkong T,Haruechaiyasak C.Improving emotion classification in imbalanced YouTube dataset using SMOTE algorithm[C]//Proceeding of the 2nd International Conference on,Advanced Informatics:Concepts,Theory and Applications,(ICAICTA).August 1922,2015.
    [17]Platt J.Fast training of support vector machines using sequential minimal optimization[M].Advances in kernel methods,Cambridge:MITPress,1999:185-208.
    [18]Brown I,Mues C.An experimental comparison of classification algorithms for imbalanced credit scoring data sets[J].Expert Systems with Applications,2012,39(3):3446-3453.
    [19]Sun Yanming,Wong A K C,Kamel M S.Classification of imbalanced data:A review[J].International Journal of Pattern Recognition and Artificial Intelligence,2009,23(04):687-719.
    [20]舒红平,游志胜,蒋建民.基于信息熵的决策属性分类挖掘算法及应用[J].计算机工程与应用,2004,40(1):186-189.
    [21]DeMántaras R L.A distance-based attribute selection measure for decision tree induction[J].Machine Learning,1991,6(1):81-92.
    [22]Tong Shaoteng.Interval number and fuzzy number linear programmings[J].Fuzzy Sets and Systems,1994,66(3):301-306.
    [23]Perone C S.Pyevolve:a Python open-source framework for genetic algorithms[J].Acm Sigevolution,2009,4(1):12-20.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700