基于层次聚类的不平衡数据加权过采样方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于层次聚类的不平衡数据加权过采样方法

详细信息查看全文 | 推荐本文 |

英文篇名：Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data
作者：夏英 ; 李刘杰 ; 张旭 ; 裴海英
英文作者：XIA Ying;LI Liu-jie;ZHANG XU;BAE Hae-young;School of Computer Science and Technology,Chongqing University of Posts and Telecommunications;
关键词：不平衡数据 ; 层次聚类 ; 过采样 ; 重叠样本
英文关键词：Imbalanced data;;Hierarchical clustering;;Oversampling;;Overlapping sample
中文刊名：JSJA
英文刊名：Computer Science
机构：重庆邮电大学计算机科学与技术学院;
出版日期：2019-04-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金(41571401)资助
语种：中文;
页：JSJA201904004
页数：6
CN：04
ISSN：50-1075/TP
分类号：28-33

摘要

不平衡数据对传统分类算法的性能有一定影响,使得少数类的识别率降低。过采样是处理不平衡数据集的常用方法之一,其主要思想是通过增加少数类样本,使得少数类与多数类的数量能够在一定程度上达到平衡,但现有的过采样方法存在合成重叠样本以及过拟合的问题。文中提出一种基于层次聚类的不平衡数据加权过采样方法WOHC(Weighted Oversampling method based on Hierarchical Clustering)。该方法首先使用层次聚类算法对少数类进行聚类,将少数类样本划分为多个类簇,然后计算出类簇的密度因子来确定各类簇的采样倍率,最后根据每个类簇中样本与多数类边界的距离确定采样权重。利用该方法采样并结合C4.5算法在多个数据集上进行分类实验,结果表明使用该方法能够使分类算法在F-measure和G-mean指标上分别提升7.6%和5.8%,体现了该方法的有效性。
Imbalanced data affect the performance of traditional classification algorithms to some extent,leading to a lower recognition rate for minority classes.Oversampling is one of the common methods for processing Imbalanced data-sets.Its main idea is to increase the number of minority class samples so that the number of minority classes and majority classes can be balanced to a certain extent.Existing oversampling methods have problems of synthesis of overlapping samples and overfitting.This paper proposed a weighted oversampling method based on hierarchical clustering for Imbalanced data,named WOHC.It uses hierarchical clustering algorithm to divide the minority class samples into several clusters first,then it calculates the clusters' density factors to determine the sampling rate of each cluster,and finally determines the sampling weights according to the distance between the minority classes and the boundary of majority classes.In the experiments,WOHC method is adopted for oversampling and C4.5 algorithm is combined to perform the classification experiment on several datasets.Results show that the proposed method can improve the performance of algorithm by 7.6% and 5.8% on F-measure and G-mean respectively,which indicates the effectiveness of the method.

引文

[1] MALHOTRA R,KHANNA M .An empirical study for soft- ware change prediction using imbalanced data[J].Empirical Software Engineering,2017,22(6):1-46.
    [2] JEONG H,JANG Y,BOWMAN P J,et al.Classification of mo- tor vehicle crash injury severity:A hybrid approach for imba-lanced data[J].Accident Analysis & Prevention,2018,120:250-261.
    [3] JIANG J ,LIU X ,ZHANG K ,et al.Automatic diagnosis of imbalanced ophthalmic images using a cost-sensitive deep convolutional neural network[J].BioMedical Engineering OnLine,2017,16(1):132.
    [4] LI Y,GUO H,ZHANG Q,et al.Imbalanced text sentiment classification using universal and domain-specific knowledge[J].Knowledge-Based Systems,2018,160:1-15.
    [5] DAL P A .Learned lessons in credit card fraud detection from a practitioner perspective[J].Expert Systems with Applications,2014,41(10):4915-4928.
    [6] TANG B,HE H.GIR-based Ensemble Sampling Approaches for Imbalanced Learning[J].Pattern Recognition,2017,71:306-319.
    [7] BIAN J,PENG X G,WANG Y,et al.An Efficient Cost-Sensitive Feature Selection Using Chaos Genetic Algorithm for Class Imbalance Problem[J].Mathematical Problems in Engineering,2016,2016(6):1-9.
    [8] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
    [9] HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
    [10] BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSIN- SAP C.Safe-Level-SMOTE:Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]//Pacific-Asia Conference on Advances in Know-ledge Discovery and Data Mining.Springer-Verlag,2009:475-482.
    [11] WANG J H,DUAN B Q.Research on a density based SMOTE method[J].CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese)王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017(6):865-872.
    [12] CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]//IEEE International Conference on Granular Computing.IEEE,2006:732-737.
    [13] LIU Y X,LIU S M,LIU T,et al.A new oversampling algorithm DB-SMOTE[J].Computer Engineering and Applications,2014,50(6):92-95.(in Chinese)刘余霞,刘三民,刘涛,等.一种新的过采样算法DB-SMOTE[J].计算机工程与应用,2014,50(6):92-95.
    [14] VOORHEES E M.Implementing agglomerative hierarchic clustering algorithms for use in document retrieval [J].Information Processing & Management,1986,22(6):465-476.
    [15] CHEN S,GUO G D,CHEN L F.Unbalanced data classification method based on clustering fusion[j].Pattern Recognition and Artificial Intelligence,2010,23(6):772-780.(in Chinese)陈思,郭躬德,陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能,2010,23(6):772-780.
    [16] MATHEW J,PANG C K,LUO M,et al.Classification of Imba- lanced Data by Oversampling in Kernel Space of Support Vector Machines[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076.
    [17] UCI Machine Learning Repository[EB/OL].http://archive.ics.uci.edu/ml/index.php.
    [18] BOMBARA G,VASILE C I,PENEDO F,et al.A Decision Tree Approach to Data Classification using Signal Temporal Logic[C]//International Conference on Hybrid Systems:Computation and Control.ACM,2016:1-10.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700