用户名: 密码: 验证码:
基于识别关键样本点的非平衡数据核SVM算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Kernel SVM Algorithm Based on Identifying Key Samples for Imbalanced Data
  • 作者:郭婷 ; 王杰 ; 刘全明 ; 梁吉业
  • 英文作者:GUO Ting;WANG Jie;LIU Quanming;LIANG Jiye;School of Computer and Information Technology,Shanxi University;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University;
  • 关键词:非平衡数据集 ; 核支持向量机 ; 划分 ; 欠采样
  • 英文关键词:Imbalanced Data;;Kernel Support Vector Machine;;Partition;;Under-Sampling
  • 中文刊名:MSSB
  • 英文刊名:Pattern Recognition and Artificial Intelligence
  • 机构:山西大学计算机与信息技术学院;山西大学计算智能与中文信息处理教育部重点实验室;
  • 出版日期:2019-06-15
  • 出版单位:模式识别与人工智能
  • 年:2019
  • 期:v.32;No.192
  • 基金:国家自然科学基金项目(No.61876103);; 山西省重点研发计划重点项目(No.201603D111014);; 山西省1331工程项目资助~~
  • 语种:中文;
  • 页:MSSB201906011
  • 页数:8
  • CN:06
  • ISSN:34-1089/TP
  • 分类号:91-98
摘要
非平衡数据处理中常采用的欠采样方法很少考虑支持向量机(SVM)的特性,并且在原始空间进行采样会导致多数类样本部分关键信息的丢失.针对上述问题,文中提出基于识别关键样本点的非平衡数据核SVM算法.基于初始超平面有效划分多数类样本,在高维空间中对每个分块进行核异类近邻抽样,得到多数类中的关键样本点,使用关键样本点和少数类样本训练最终核SVM分类器.在多个数据集上的实验证明文中算法的可行性和有效性,特别是在非平衡度高于10∶1的数据集上,文中算法优势明显.
        Under-sampling is often employed in imbalanced data processing. However, the characteristics of support vector machine(SVM) are seldom taken into account in the existing under-sampling methods,and the problem of losing some key information of the majority class is caused by the sampling in the original space. To solve these problems, a kernel SVM algorithm based on identifying key samples for imbalanced data(IK-KSVM) is proposed in this paper. Firstly, the majority class is divided effectively based on the initial hyperplane. Then, kernel heterogeneous nearest neighbor sampling is conducted on each partition to obtain the key samples of the majority class in the high-dimensional space. Finally, the final SVM classifier is trained by the key samples and the minority class samples. Experiments on several datasets show that IK-KSVM is feasible and effective and its advantages are evident while the imbalance degree of the dataset is higher than 10∶1.
引文
[1] HE H B,GARCIA E A.Learning from Imbalanced Data.IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1263-1284.
    [2] WANG S,MINKU L L,YAO X.Resampling-Based Ensemble Methods for Online Class Imbalance Learning.IEEE Transactions on Knowledge and Data Engineering,2015,27(5):1356-1368.
    [3] TAHIR M A,KITTLER J,YAN F.Inverse Random under Sampling for Class Imbalance Problem and Its Application to Multi-label Classification.Pattern Recognition,2012,45(10):3738-3750.
    [4] CHAWLA N V,BOWYER K,HALL L O,et al.SMOTE:Synthe-tic Minority Over-Sampling Technique.Journal of Artificial Intelligence Research,2011,16:321-357.
    [5] SHAO Y H,CHEN W J,ZHANG J J,et al.An Efficient Weighted Lagrangian Twin Support Vector Machine for Imbalanced Data Cla-ssification.Pattern Recognition,2014,47(9):3158-3167.
    [6] AKBAIN R,KWEK S,JAPKOWICZ N.Applying Support Vector Machines to Imbalanced Data Sets // Proc of the European Confe-rence on Machine Learning.Berlin,Germany:Springer,2004:39-50.
    [7] WANG B X,JAPKOWICZ N.Boosting Support Vector Machines for Imbalanced Data Sets.Knowledge and Information Systems,2010,25(1):1-20
    [8] SUN Z B,SONG Q B,ZHU X Y,et al.A Novel Ensemble Method for Classifying Imbalanced Data.Pattern Recognition,2015,48(5):1623-1637.
    [9] GUO H X,LI Y J,JENNIFER S,et al.Learning from Class-Imba-lanced Data:Review of Methods and Applications.Expert Systems with Applications,2016,73(1):220-239.
    [10] ZHANG J P,MANI I.KNN Approach to Unbalanced Data Distributions:A Case Study Involving Information Extraction // Proc of the International Conference on Machine Learning.Palo Alto,USA:AAAI Press,2003:42-48.
    [11] LIN W C,TSAI C F,HU Y H,et al.Clustering-Based Under-sampling in Class-Imbalanced Data.Information Sciences,2017,409/410:17-26.
    [12] KANG Q,SHI L,ZHOU M C,et al.A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and Its Application to Imbalanced Classification.IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4152-4165.
    [13] JIAN C X,GAO J,AO Y H.A New Sampling Method for Classi-fying Imbalanced Data Based on Support Vector Machine Ensemble.Neurocomputing,2016,193:115-122.
    [14] 孙建涛,郭崇慧,陆玉昌,等.多项式核支持向量机文本分类器泛化性能分析.计算机研究与发展,2004,41(8):1321-1326.(SUN J T,GUO C H,LU Y C,et al.Estimating the Generalization Performance of Polynomial SVM Classifier for Text Categorization.Journal of Computer Research and Development,2004,41(8):1321-1326.)
    [15] KANG S,CHO S.Approximating Support Vector Machine with Artificial Neural Network for Fast Prediction.Expert Systems with Applications,2014,41(10):4989-4995.
    [16] 张学工.关于统计学习理论与支持向量机.自动化学报,2000,26(1):32-42.(ZHANG X G.Introduction to Statistical Learning Theory and Support Vector Machines.Acta Automatica Sinica,2000,26(1):32-42.)
    [17] ANGIULLI F,FOLINO G.Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets.IEEE Transactions on Knowledge and Data Engineering,2007,19(12):1593-1606.
    [18] LIN C T,HSIEH T Y,LIU Y T,et al.Minority Oversampling in Kernel Adaptive Subspaces for Class Imbalanced Datasets.IEEE Transactions on Knowledge and Data Engineering,2017,30(5):950-961.
    [19] SU C T,CHEN L S,YI Y.Knowledge Acquisition through Information Granulation for Imbalanced Data.Expert Systems with Applications,2006,31(3):531-541.
    [20] TANTITHAMTHAVORN C,MCINTOSH S,HASSAN A E,et al.An Empirical Comparison of Model Validation Techniques for Defect Prediction Models.IEEE Transactions on Software Enginee-ring,2016,43(1):1-18.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700