改进Smote算法在不平衡数据集上的分类研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Classification of Improved Smote Algorithm on Imbalanced Datasets
  • 作者:易未 ; 毛力 ; 孙俊 ; 吴林海
  • 英文作者:YI Wei;MAO Li;SUN Jun;WU Lin-hai;School of Internet of Things,Jiangnan University;School of Business,Jiangnan University;Food Safety Risk Management Institute,Jiangnan University;
  • 关键词:不平衡数据集 ; Smote算法 ; R-Smote算法 ; SD-ISmote算法 ; ImprovedSmote算法 ; 簇心
  • 英文关键词:imbalanced dataset;;Smote;;R-Smote;;SD-ISmote;;ImprovedSmote;;cluster center
  • 中文刊名:JYXH
  • 英文刊名:Computer and Modernization
  • 机构:江南大学物联网工程学院;江南大学商学院;江南大学食品安全风险治理研究院;
  • 出版日期:2018-03-15
  • 出版单位:计算机与现代化
  • 年:2018
  • 期:No.271
  • 基金:国家粮食公益性行业科研专项项目(201513004-6);; “十二五”农村领域国家科技计划子课题(2015BAD17B02-8);; 现代农业产业技术体系专项资金项目(CARS-49);; 江苏省产学研合作项目(BY2015019-30)
  • 语种:中文;
  • 页:JYXH201803017
  • 页数:6
  • CN:03
  • ISSN:36-1137/TP
  • 分类号:87-92
摘要
在不平衡数据集中,过抽样算法如Smote(Synthetic Minority Oversampling)算法、R-Smote算法与SD-ISmote算法可能会模糊多数类与少数类的边界以及使用噪声数据合成新样本。本文提出的ImprovedSmote算法使用少数数据集的簇心与其对应类别的少数集数据,在簇心与不大于样本属性数的对应类别少数集数据形成的图形内随机插值来生成新数据。ImprovedSmote算法结合C4.5决策树与神经网络算法在实验数据集上的结果比Smote,R-Smote与SD-ISmote算法更好,可以有效地提高分类器分类性能。
        In imbalanced datasets,the oversampling algorithm,such as Smote(Synthetic Minority Oversampling) algorithm,RSmote algorithm and SD-ISmote algorithm,may blur the boundary between the majority and the minority and use noisy data to synthesize new samples.The ImprovedSmote algorithm proposed in this paper uses cluster center of minority set and their corresponding minority set to generate new samples.The Smote,the R-Smote,the SD-ISmote and the ImprovedSmote algorithm combined C4.5 decision tree and neural network algorithm are used on the experimental datasets.The results show that the ImprovedSmote algorithm is better than other algorithms in classification and can effectively improve classifier performance.
引文
[1]He Haibo,Garcia E A.Learning from imbalanced data[J].IEEE Transactions on Knowledge&Data Engineering,2009,21(9):1263-1284.
    [2]He Haibo,Shen Xiaoping.A ranked subspace learning method for gene expression data classification[C]//Proceedings of 2007 International Conference on Artificial Intelligence.2007:358-364.
    [3]Kubat M,Holte R C,Matwin S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2-3):195-215.
    [4]Pearson R,Goney G,Shwaber J.Imbalanced clustering for microarray time-series[C]//Workshop for Learning from Imbalanced Datasets II,2003.
    [5]Zhao Hong,Li Xiangju.A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism[J].Information Sciences,2017,378(C):303-316.
    [6]Reddy M V,Sodhi R.A rule-based s-transform and adaboost based approach for power quality assessment[J].Electric Power Systems Research,2016,134:66-79.
    [7]史颖,亓慧.一种去冗余抽样的非平衡数据分类方法[J].山西大学学报(自然科学版),2017,40(2):255-261.
    [8]Yen S J,Lee Y S.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36(3):5718-5727.
    [9]Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2011,16(1):321-357.
    [10]董燕杰.不平衡数据集分类的Random-SMOTE方法研究[D].大连:大连理工大学,2009.
    [11]王超学,潘正茂,董丽丽,等.基于改进SMOTE的非平衡数据集分类研究[J].计算机工程与应用,2013,49(2):184-187.
    [12]Han Hui,Wang Wenyuan,Mao Binghuan.BorderlineSMOTE:A new over-sampling method in imbalanced data sets learning[C]//Proceedings of 2005 International Conference on Advances in Intelligent Computing.2005,3644(5):878-887.
    [13]Santos M S,Abreu P H,García-Laencina P J,et al.A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients[J].Journal of Biomedical Informatics,2015,58:49-59.
    [14]袁铭.基于R-SMOTE方法的非平衡数据分类研究[D].保定:河北大学,2015.
    [15]古平,杨炀.面向不均衡数据集中少数类细分的过采样算法[J].计算机工程,2017,43(2):241-247.
    [16]Rodriguez A,Laio A.Machine learning.clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
    [17]杨明,尹军梅,吉根林.不平衡数据集分类方法综述[C]//第三届江苏计算机大会论文集.2008.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700