用户名: 密码: 验证码:
一种改进型的不平衡数据欠采样算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Improved Under-sampling Algorithm for Imbalanced Data
  • 作者:魏力 ; 张育平
  • 英文作者:WEI Li;ZHANG Yu-ping;School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics;
  • 关键词:不平衡数据 ; 欠采样 ; 聚类
  • 英文关键词:imbalanced data;;under-sampling;;clustering
  • 中文刊名:XXWX
  • 英文刊名:Journal of Chinese Computer Systems
  • 机构:南京航空航天大学计算机科学与技术学院;
  • 出版日期:2019-05-14
  • 出版单位:小型微型计算机系统
  • 年:2019
  • 期:v.40
  • 语种:中文;
  • 页:XXWX201905036
  • 页数:5
  • CN:05
  • ISSN:21-1106/TP
  • 分类号:184-188
摘要
不平衡数据集经常出现于很多应用领域,如果直接使用这种数据集进行分类,会对算法的学习过程造成干扰.而传统的欠采样方案会严重丢失多数类样本的信息.为解决这一问题,通过结合NearMiss算法和K-Means聚类在处理不平衡数据时的优点,提出了CBNM(Clustering-Based NearMiss)算法,该算法通过计算簇中心点的NearMiss距离,赋予该点选择权重.实验通过选择十组UCI数据集,验证在本算法中三类NearMiss算法的优劣,并与NearMiss-2算法进行比较.实验结果表明,CBNM算法在F-Measure和G-Mean上有显著提升,对分类效果的改进明显.
        The data in real-world applications often are imbalanced class distribution,which has attracted growing attention from both academic and industry. It will interfere with algorithm's learning process if all the data are used to be the training data. Traditional undersampling is a controversial method in dealing with class-imbalance problem because many majority class examples are ignored. To overcome this deficiency,a Clustering-Based Near Miss(CBNM) algorithm was proposed combining the advantages of NearMiss algorithm and K-Means in processing data. CBNM gives a weight to the cluster center by calculating the Near-Miss distance. Through utilizing UCI data sets,the experiment verifies the performance of the three NearMiss algorithms in CBNM,which is compared with NearMiss-2 algorithm. The experimental results show that the CBNM algorithm has significant improvement in F-Measure and GMean,and the improvement of the classification effect is also obvious.
引文
[1] Soda P. A multi-objective optimisation approach for class imbalance learning[J]. Pattern Recognition,2011,44(8):1801-1810.
    [2]He H,Garcia E A. Learning from imbalanced data[J]. IEEE Transactions on Know ledge&Data Engineering,2009,21(9):1263-1284.
    [3] Wan C H,Lee L H,Rajkumar R,et al. A hybrid text classification approach w ith low dependency on parameter by integrating K-nearest neighbor and support vector machine[J]. Expert Systems w ith Applications,2012,39(15):11880-11888.
    [4]Weiss G M. Mining with rarity[J]. Acm S-igkdd Explorations New sletter,2004,6(1):7-19.
    [5]Provost F,Weiss G M. Learning when tra-ining data are costly:the effect of class distribution on tree induction[J]. Journal of Artificial Intelligence Research,2011,19(1):315-354.
    [6] Raskutti B,Kowalczyk A. Extreme rebala-ncing for SVMs:a case study[J]. Sigkdd Explorations,2004,6(1):60-69.
    [7]Zheng Z,Srihari R K. Optimally combining positive and negative features for text categorization[C]. Proceedings of the ICM L'03Workshop on Learning from Imbalanced Data Sets,Washington DC:AAAI Press,2003.
    [8]Fan W,Stolfo S J,Zhang J,et al. AdaCost:misclassification costsensitive boosting[C]. International Conference on M achine Learning,1999:97-105.
    [9]Wu G,Chang E Y. Class-boundary alignment for imbalanced data set learning[C]. ICM L Workshop on Learning from Imbalanced Data Sets,2003:49-56.
    [10]Ertekin S,Huang J,Bottou L,et al. Learning on the border:active learning in imbalanced data classification[C]. Conference on Information and Know ledge M anagement,2007:127-136.
    [11]Ha J,Lee J S. A new under-sampling method using genetic algorithm for imbalanced data classification[C]. International Conference on Ubiquitous Information M anagement&Communication,ACM,2016:95.
    [12]Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research,2011,16(1):321-357.
    [13]Han H,Wang W Y,Mao B H. Bord-erline-SMOTE:a new oversampling method in imbalanced data sets learning[C]. Proc. Int'l Conf,Intelligent Computing,2005:878-887.
    [14]Zhang J,Mani I. k NN approach to unbalanced data distributions:a case study involving information extraction[C]. In Proceedings of the ICM L 2003 Workshop on Learning from Imbalanced Datasets,2003.
    [15] Xiong Bing-yan,Wang Guo-yin,Deng Wei-bin. Under-sampling method based on sample w eight for imbalanced data[J]. Journal of Computer Research and Development,2016,53(11):2613-2622.
    [16]Wang Chao-xue,Zhang Tao,Ma Chun-sen. Improved SMOTE algorithm for imbalanced datasets[J]. Journal of Frontiers of Computer Science and Technology,2014,8(6):727-734.
    [17] Yang Jie-ming,Yan Xin,Qu Zhao-yang,et al. Under-sampling technique based on data density distribution[J]. Journal of Frontiers of Computer Science and Technology,2016,33(10):2997-3000.
    [18] Wu Chang-an,Zheng Gui-rong,Sun Yan-ge,et al. Imbalanced learning based on K-means and logistic regression mixed strategy[J]. Journal of Chinese Computer Systems,2017,38(9):2119-2124.
    [19]Liu Jing,Gu Li-ze,Niu Xin-xin,et al. Research on network anomaly detection based on one-class SVM and active learning[J]. Journal on Communications,2015,36(11):136-146.
    [15]熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.
    [16]王超学,张涛,马春森.面向不平衡数据集的改进型SMOTE算法[J].计算机科学与探索,2014,8(6):727-734.
    [17]杨杰明,闫欣,曲朝阳,等.基于数据密度分布的欠采样方法研究[J].计算机应用研究,2016,33(10):2997-3000.
    [18]邬长安,郑桂荣,孙艳歌,等. K-means和逻辑回归混合策略的不平衡类学习方法[J].小型微型计算机系统,2017,38(9):2119-2124.
    [19]刘敬,谷利泽,钮心忻,等.基于单分类支持向量机和主动学习的网络异常检测研究[J].通信学报,2015,36(11):136-146.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700