基于支持向量机的不平衡数据集分类算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
现代计算机技术的高速发展,使得在科学研究和社会生活的各个领域中积累了大量的数据,为将这些数据转换成有用的信息和知识,数据挖掘技术应运而生并得以迅速发展。但是存在一类数据集称为不平衡数据集,这种数据集中一类数据的数目远远大于另一类数据的数目,而且往往少数类提供的信息更加重要,所以不平衡数据集的分类问题成为现在数据挖掘领域研究的一个热点。支持向量机是一种建立在统计学习理论基础上的分类方法,具有坚实的理论基础,对于普通数据集有比其他分类算法好的分类效果,但是对于不平衡数据集的分类效果并不是很好。
     本文的研究内容首先从不平衡数据集的特点入手,提出基于聚簇的下采样方法,通过分析得到支持向量机在不平衡数据集分类时失效的原因,采用提出的下采样方法,对多数类的支持向量进行下采样,目的是删除一部分多数类样本,以降低多数类与少数类的不平衡程度,然后利用不同类惩罚支持向量机对新样本集进行训练,达到提高分类精度的目的。
     现今流行的处理不平衡数据集分类的方法之一是代价敏感学习,但是支持向量机本身并不具有代价敏感性,所以并不适用于代价敏感数据挖掘,本文提出基于数据集分解的代价敏感支持向量机,通过输出后验概率和元学习过程,重构一个集成了误分类代价的新样本集,使用代价敏感支持向量机对重构的新样本集进行训练,以使分类的误分类代价最小。
     对每一个算法都进行了仿真实验,使用不同的评价准则,通过实验结果和对实验结果的分析表明两个算法分别从提高分类精度,使误分类代价最小方面达到了很好的效果。
The rapid development of modern computer technology, making the research and all areas of social life have accumulated large amounts of data, in order to convert these data into useful information and knowledge, data mining techniques emerged and developed rapidly.
     But there is a class of data set known as the imbalanced data set, this data set the number of a class of data is far greater than the number of another type of data and information provided by the minority class is often more important, so the classification of imbalanced data sets Data mining is becoming a hot research field. Support vector machine is built based on statistical learning theory of classification, has a solid theoretical basis for common data set than other classification algorithms achieve the best performance, but for the imbalanced data set is not very good classification results.
     This paper will first of all the characteristics of imbalanced data sets from the uneven start, The next proposed under-sampling based on cluster methods, By analyzing the obtained support vector machine classification in the imbalanced data set causes the failure, under the proposed sampling method used for majority class support vector for the under-sampling, the purpose is to remove part of the majority class samples to reduce the imbalanced degree of majority class and minority class, and then use SVM to train the new sample set, to improve the classification accuracy purposes.
     Current popular classification of imbalanced data sets dealing with one of the methods is cost-sensitive learning, but the support vector machine itself does not have the cost of sensitivity, it does not apply to consideration of cost-sensitive data mining, data sets based on decomposition of the proposed cost-sensitive support vector machine, through the output a posteriori probability and meta-learning process,an integrated reconstruction of misclassification cost of the new sample set, using the support vector machine on the reconstruction of the new training sample set, so that the minimum misclassification cost classification.
     Have carried out an algorithm for each simulation experiment, using different evaluation criteria, the experiment results and analysis of experimental results shows that the two algorithms are from improving the accuracy and to make the minimum misclassification cost have reached good results.
引文
[1]金金宝.基因表达式程序设计的研究和应用.华东师范大学硕士学位论文.2009:1页
    [2]JAPKOWICX N,STEPHEN S.The class imbalance problem:a systematic study Intellingent Data Analysis.2002,6(5):203-231P
    [3]Mazurowski M A, Habas P A, Zurada J M, et al.Training neural network classifiers for medical decision making:The effects of imbalanced datasets on classification performance.Neural Networks.2008,21(3):427-436P
    [4]E Stamatatos.Author identification:Using text sampling to handle the class imbalance problem.Information Processing and Management.2008,44(2):790-799P
    [5]杨明,尹军梅,吉根林.不平衡数据分类方法综述.南京师范大学学报.2008,8(4).7-8贝
    [6]Vapnik V. The nature of statistical learning theory.New York:Springer-Verlag.1995
    [7]郑恩辉,李平,宋执环.不平衡数据挖掘:类分布对支持向量机的影响.信息与控制.2005,34(6):703-708页
    [8]Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning.2002,46:191-202P
    [9]Tao Q,Wu G W,Wang F Y, et al.Posterior probability support vector machines for unbalanced data.IEEE Trans on Neural Network.2005,16(6):1561-1573P
    [10]Chew H G,Bonger R E,Lim C C.Dual nu-support vector machine with error rate and tranining size biasing.Proceedings of the 26th IEEE International Conference Acoustics Speech and Signal, Salt Lake:(USA),2001:1269-1272P
    [11]Lin C F,Wang S D.Fuzzy support vector machines.IEEE Trans on Neural Networ ks.2002,13(2):464-471P
    [12]郑恩辉,丽萍.代价敏感支持向量机.控制与决策.2006,21(4):32-39页
    [13]李正欣,赵林度.基于SMOTEboost的非均衡SVM分类器.系统工程研究所.2008(5),80-85页
    [14]文传军,詹永兆.基于自调节分类面SVM的不平衡数据分类.系统工程.2009,110-114页
    [15]王和勇,樊泓坤,姚正安,李成安.不平衡数据集的分类方法研究.计算机应用研 究.2008,25(5):1301-1303页
    [16]Maloof M A.Learningwhen data sets are imbalanced and when costs are unequal and unknown.ICML-2003 Workshop on Learning from Imbalanced Data Sets Ⅱ. Washington DC:AAA I Press,2003:531-542P
    [17]Barandela R, Sanchez J S, Garcia V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recognition.2003,36(3):849-851P
    [18]Chawla N V, Bowyer K W, Hall L O, et al. Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research.2002,16(3):321-357P
    [19]Japkowicz N, Stephen S. The class imbalanced problem:A systematic study. Intelligent Data Analysis.2002,6(5):429-449P
    [20]Barandela R,Valdovinos R M,Sanchez J S,et al.The imbalanced training sample problem:Under or over sampling, LNCS,Berlin:Springer-Verlag.2004:806-814P
    [21]Akbai R,Kweks S,Japkowicz N.Applying support vector machines to imbalanced datasets.5th European Conference on Machines Learning, Berlin:Springer-Verlag, 2004:39-50P
    [22]Han H, Wang W Y, Mao B H. Borderline-Smote:A New Over-Sampling Method in Imbalanced Data Sets Learning.International Conference on Intelligent Computing (ICIC05). Lecture Notes in Computer Science.2005,3644:878-887P
    [23]Tang S, Chen S P.The generation mechanism of synthetic minority class examples. International Conference on Information Technology and Applications in Biomedicine. 2008,444-447P
    [24]Wang J J, Xu M T, Wang H, Zhang J W.Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding.The 8th International Conference on Signal Processing.2007
    [25]Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans KnowlData Eng.2006,18 (1):63-77P
    [26]Rehan A,Stephen K,Nathalie J.Applying support vector machines to imbalanced d atasets.Fifteenth European Confererceon MachinesLearning.Berlin:Springer-Verlag,2 004:39-50P
    [27]Weiss G. M. Mining with rarity:a unifying framework.ACM SIGKDD Explorations. 2004,6 (1):7-19P
    [28]Kubat M,Matwin S.Addressing the curse of imbalanced datasets.One-sided Sampling Proceedings of the Fourteenth International Conference on Machine Leaning, Nashville: Tennessee,1997:178-186P
    [29]Weiss G. M. Mining with rarity:a unifying framework.ACM SIGKDD Explorations, 2004,6 (1):7-19P
    [30]Drown D J,K hoshgoftaar TM,Narayanan R.Using evolutionary sampling to mine imbalanced data.The 6th International Conference on Machine Learning and Applications. Washington DC:IEEE Computer Society,2007:363-368P
    [31]Chawla N.V.,Cieslak D.A.,Hall L.O.,et al.Automatically Countering Imbalance and Empirical Relationship to Cost.Data Mining and Knowledge Discovery.2008,17 (2):225-252P
    [32]Garcia S, Herrera F.Evolutionary Under-Sampling for Classification with Imbalanced DataSets:Proposal sand Taxonomy. Evolutionary Computation.2008,17(3):275-306P
    [33]Han H., Wang W Y, Mao B. H. Borderline-Smote:A New Over-Sampling Method in Imbalanced Data Sets Learning.International Conference on Intelligent Computing (ICIC05). Lecture Notes in Computer Science.2005,3644:878-887P
    [34]Hulse J.V., Khoshgoftaar T. M.,Napolitano A. Experimental Perspectives on Learning from Imbalanced Data.In Proceedings of the 24th International Conference on Machine Learning (ICML 2007). New York, NY, USA:ACM,2007:935-942P
    [35]Mccarthy K, Zabar B,Weiss GDoes Cost-Sensitive Learning Beat Sampling for C lassifying Rare Classes.Proceedings of the 1st International workshop on Utility-based Data Mining. New York:ACM,2005:69-77P
    [36]Ciraco M,RogalewskiM,Weiss G.Improving classifier utility by altering the misclassification cost ratio.Proceedings of the 1st International Workshop of Utility based Data Mining. New York:ACM,2005:46-52P
    [37]Raskutti B, Kowalczyk A. Extreme rebalancing for SVMs:a case study. News letter of the ACM Special Interest Group on Knowledge Discovery and Data Mining.2004,6 (1): 61-69P
    [38]Lee Y,L in Y,Wahba G. Multicategory support vector machines:theory and application to the classification of microarray data and satellite radiance data. Wisconsin: University of Wisconsin,2002:206-219P
    [39]谢纪刚,裘正定.不平衡数据集Fisher线性判别模型.北京交通大学学报.2006,30(5):15-18页
    [40]Karagiannopoulos M QAnyfantis D S, Kotsiantis S B, et al.Local cost sensitive learning for handling imbalanced datasets.2007 Mediterranean Conference on Control and Automation. Athens:IEEE Press,2007:1-6P
    [41]陈宇宙.面向非平衡混合型数据的分类算法及应用研究.中南大学硕士学位论文.2008:11P
    [42]Freund Y.Schapire R.E.Abe N.A Short Introduction to Boosting.Journal of Japanese Socitety for Artificial Intelligence.1999,14:771-780P
    [43]Schapire R E, Singer Y.Improved boosting algorithms using confidence rated predictions. Machine Learning.1999,37(3):297-336P
    [44]Fan W, Stolfo S J, Zhang J,et al.AdaCost:misclassification cost-sensitive boosting. Bratko I, Dzeroski S.Proceeding of the 16th International Conference on Machine Learning. Morgan Kaufmann.1999:97-105P
    [45]JoshiM V,Kumar V,Agarwal R C.Evaluating boosting algorithms to classify rare classes: comparison and improvements.Cercone N, L in T Y,Wu X.Pro of the 2001 IEEE Intern Conf on DataMining.Washington DC:IEEE Computer Society Press,2001:257-264P
    [46]Chawla N V, Japko wicz, Kolcz A.Editorial:special issue on learning from imbalanced data sets.SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets. 2004,6 (1):126P
    [47]Chawlal N V, Lazarevic A, HallL O.SMOTE Boost:improving prediction of the minority class in boosting.The 7th European Conf on Principles and Practice of Knowledge Discovery in Databases. Berlin:Springer,2003:107-119P
    [48]He Guoxun,Han Hui,Wang Wenyuan.An over-sampling expert system for learning from imbalanced data sets. Neural Networks and Brain.2005,1:537-541P
    [49]缪志敏.基于单分类器的数据不平衡问题研究.南京:中国人民解放军理工大学指挥自动化学院,2008,13-15页
    [50]Raskutti B, Kowalczyk A.Extreme rebalancing for SVMs:a case study.News letter of the ACM Special Interest Group on Knowledge Discovery and DataMining.2004,6 (1): 61-69P
    [51]Veropoulos K,Campbell C.Cristianini N.Controlling the sensitivity of support vecto rmachines.1999:36-44P
    [52]Wu,G.,Chang,E.Class-Boundary Alignment for Imbalanced Dataset Learing.In ICM L 2003 Workshop on Learning from Imbalanced Data Sets II,Washington DC,200 3:118-127P
    [53]Estabrooks A, Taeho J,Japkowicz N.A multiple resampling method for learning from imbalanced data sets.Computational Intelligence.2004,20(l):18-36P
    [54]Drummond C, Holte R C. C4.5,class imbalance,and cost sensitivity:why under-sa mpling beats over-sampling.Proceedings of the Workshop on Learning from Imbal anced Data Sets Ⅱ,International Conference on Machine Learning,Washington;AAA I Press,2003:20-23P
    [55]Platt J.Probabilistic outputs for support vector machines and comparison to regularized likelihood method.USA:MIT Press,1996:115-118P
    [56]Domingos P.Metacost:A General Method for Making Classifiers Cost-Sensitive.The 5th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining(KDD 99)New York.USA.1999:155-164P

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700