基于三支决策的不平衡数据过采样方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:An Oversampling Method for Imbalance Data Based on Three-Way Decision Model
  • 作者:胡峰 ; 王蕾 ; 周耀
  • 英文作者:HU Feng;WANG Lei;ZHOU Yao;Chongqing Key Laboratory of Computational Intelligence ( Chongqing University of Posts and Telecommunications);
  • 关键词:三支决策 ; 邻域粗糙集 ; 边界采样 ; 不平衡数据 ; SMOTE
  • 英文关键词:three-way decision;;neighborhood rough set;;boundary sampling;;imbalanced data;;SMOTE
  • 中文刊名:DZXU
  • 英文刊名:Acta Electronica Sinica
  • 机构:计算智能重庆市重点实验室(重庆邮电大学);
  • 出版日期:2018-01-15
  • 出版单位:电子学报
  • 年:2018
  • 期:v.46;No.419
  • 基金:国家自然科学基金(No.61309014,No.61379114,No.61472056);; 教育部人文社科规划(No.15XJA630003);; 重庆市基础与前沿研究计划(No.cstc2013jcyjA40063,No.cstc2014jcyjA40049);; 重庆市教委科学技术研究(No.KJ1500416)
  • 语种:中文;
  • 页:DZXU201801019
  • 页数:10
  • CN:01
  • ISSN:11-2087/TN
  • 分类号:138-147
摘要
采样是解决不平衡数据分类问题的一个有效途径.文中结合三支决策理论,根据样本分布将样本划分成三个区域:正域、边界域和负域;在此基础上,分别对边界域和负域中的小类样本进行不同的过采样处理,提出了一种基于三支决策的不平衡数据过采样算法(TWD-IDOS算法).实验结果表明,在C4.5、KNN和CART等分类器上,文中提出的算法能有效解决不平衡数据的二分类问题,在Recall、F-value、AUC等指标上优于文献中的过采样算法.
        Sampling is an effective way to solve the problem of unbalanced data classification. According to the distribution of samples,we employ the three-way decision model to divide the universe into three parts: positive region,boundary region and negative region. After that,we oversample the minority class samples in boundary region and negative region respectively.Then,a novel oversampling algorithm for imbalance data based on three-way decision model,namely TWDIDOS,is developed. The experimental results show that the proposed method can effectively solve the two-class classification problems of imbalanced data and has a better performance in such measures( Recall、F-value、AUC) on C45,KNN and CART classifiers than other oversampling methods.
引文
[1]DRUMMOND C,et al.C4.5,class imbalance,and cost sensitivity:w hy under-sampling beats over-sampling[A].Proceedings of ICM L Workshop on Learning from Im-Balanced Datasets II[C].New York:ACM,2003.1-8.
    [2]CHAWLA N V,BOWYER K W,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,(16):321-357.
    [3]YEN S J,et al.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems w ith Applications,2009,36(3):5718-5727.
    [4]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:anew over-sampling method in imbalanced data sets learning[A].Proceedings of International Conference on Intelligent Computing(ICIC)[C].Germany:Springer,2005.878-887.
    [5]杨智明,乔立岩,彭喜元.基于改进SMOTE的不平衡数据挖掘方法研究[J].电子学报,2007,35(12):22-26.YANG Zhiming,QIAO Liyan,PENG Xiyuan.Research on data ming method for imbalanced dataset based on improved SM OTE[J].Acta Electronica Sinica,2007,35(12):22-26.(in Chinese)
    [6]RAMENTOL E,CABALLERO YAILé,BELLO R,et al.SM OTE-RSB*:a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced datasets using SM OTE and rough sets theory[J].Know ledge and Information Systems,2012,33(2):245-265.
    [7]曾志强,吴群,廖备水,等.一种基于核SMOTE的非平衡数据集分类方法[J].电子学报,2009,37(11):2489-2495.ZENG Zhiqiang,WU Qun,LIAO Beishui,et al.A classfication method for imbalance data set based on Kernel SM OTE[J].Acta Electronica Sinica,2009,37(11):2489-2495.(in Chinese)
    [8]翟云,王树鹏,马楠,等.基于单边选择链和样本分布密度融合机制的非平衡数据挖掘方法[J].电子学报,2014,42(7):1311-1319.ZHAI Yun,WANG Shupeng,M A Nan,et al.A data mining method for imbalanced datasets based on one-sided link and distribution density of instance[J].Acta Electronica Sinica,2014,42(7):1311-1319.(in Chinese)
    [9]王磊,黄河笑,吴兵,等.基于主题与三支决策的文本情感分析[J].计算机科学,2015,42(6):93-96.WANG Lei,HUANG Hexiao,WU Bing,et al.Emotion analysis of text based on topics and three-w ay decisions[J].Computer Science,2015,42(6):93-96.(in Chinese)
    [10]LI H X,et al.Sequential three-way decision and granulation for cost-sensitive face recognition[J].Know ledgeBased Systems,2016,91(1):241-251.
    [11]LIU D,LI T R,et al.Incorporating logistic regression to decision-theoretic rough sets for classifications[J].International Journal of Approximate Reasoning,2014,55(1):197-210.
    [12]YU H,ZHANG C,WANG G Y.A tree-based incremental overlapping clustering method using the three-w ay decision theory[J].Know ledge-Based Systems,2016,91(1):189-203.
    [13]LIU S L,LIU X W.A novel three-way decision basedon linguistic evaluation[A].Proceedings of 2015 IEEE International Conference on Fuzzy Systems(FUZZ-IEEE)[C].Istanbul:IEEE,2015.1-7.
    [14]LIU D,LIANG D C,et al.A novel three-way decision model based on incomplete information system[J].Know ledge-Based Systems,2016,91(1):32-45.
    [15]CHEN Y M,ZENG Z Q,et al.Three-way decision reduction in neighborhood systems[J].Applied Soft Computing,2016,38(1):942-954.
    [16]LIU D,LI T R,et al.A multiple-category classification approach w ith decision-theoretic rough sets[J].Fundamenta Informaticae,2012,115(2-3):173-188.
    [17]ZHOU B.Multi-class decision-theoretic rough sets[J].International Journal of Approximate Reasoning,2014,55(1):211-224.
    [18]LIN T Y.Neighborhood systems and approximation in relational databases and know ledge bases[A].Proceedings of the Fourth International Symposium on M ethodologies of Intelligent Systems[C].Charlotte NC:Oak Ridge National Laboratory,1989.75-86.
    [19]HU Q H,et al.Neighborhood classifiers[J].Expert Systems w ith Applications,2008,34(2):866-876.
    [20]STANFILL C,WALTZ D.Toward memory-based reasoning[J].Communications of the ACM,1986,29(12):1213-1228.
    [21]YAO Y Y.An outline of a theory of three-way decisions[A].Proceedings of Eighth International Conference of RSCTC[C].Germany:Springer,2012.1-17.
    [22]刘盾,李天瑞,苗夺谦,等.三支决策与粒计算[M].北京:科学出版社,2013.12-30.LIU Dun,LI Tianrui,M IAO Duoqian,et al.Three-Way Decision and Granular Computing[M].Beijing:Science Press,2013.12-30.(in Chinese)
    [23]HU F,LI H.A novel boundary oversampling algorithm based on neighborhood rough set model:NRSBoundarySM OTE[J/OL].M athematical Problems in Engineering,2013,Article ID 694809,doi:10.1155/2013/694809.
    [24]LAURIKKALA J.Improving identification of difficult smallclasses by balancing class distribution[A].Proceedings of Eighth Conference on Artificial Intelligence in Medicine in Europe(AIME)[C].Germany:Springer,2001.63-66.
    [25]TOMEK I.An experiment with the edited nearest-neighbor rule[J].IEEE Transactions on Systems,M an,and Cybernetics,1976,6(6):448-452.
    [26]LIU X Y,WU J,ZHOU Z H.Exploratory under-sampling for class-imbalance learning[J].IEEE Transactions on Systems,M an,and Cybernetics,Part B(Cybernetics),2009,39(2):539-550.
    [27]Wikipedia Weka(machine learning)[CP/OL].http://en.w ikipedia.org/w iki/Weka,2010.
    [28]Learning and Mining from Data(LAMDA)[CP/OL].http://lamda.nju.edu.cn/CH.Data.ashx,2016-10-31.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700