非平衡数据下的核方法分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
非平衡数据分类问题广泛存在于现实世界中,如医疗诊断、雷达图像监测、诈骗检测等。由于其固有的不均匀特性,即正负样本数目相差悬殊,导致了传统的分类算法的有效性降低。因此如何对其有效的准确分类已经成为当前机器学习和模式识别领域中的研究热点之一。
     本文的研究工作主要以传统的核方法为基础,通过新的过采样方法以及结合基于不同惩罚因子的支持向量机学习算法,来达到优化非平衡数据分类性能的目的。本文的主要贡献有:
     (1)针对非平衡数据中的不平衡问题,提出了在核方法里的象空间进行数据处理的方法即SMOIS(Synthetic Minority Over-sampling In Image Space)方法。该方法不同于在数据原空间中产生新合成的少数类样本的策略,而是通过在映射后的象空间(Image Space)里引入非重复性的人造少数类样本,以减少分类算法对少数类样本的敏感度,实验结果表明,在ROC曲线和g-means评估度量上该方法能达到一个更好的分类性能。
     (2)支持向量机方法(SVM)是一种有效的分类学习算法,但在处理非平衡数据时,效果往往不尽人意。因此本文将SMOIS方法与改良的支持向量机算法结合起来,提出基于SMOIS的支持向量机学习算法,从而达到有效分类非平衡数据的目的。
     本文的研究内容是当前的热点问题之一。该研究成果不仅具有重要的理论意义而且也具有直接的应用价值。
Imbalanced dataset classification problem is very common in the real world, such as medical diagnostic, radar image detection, fraud detection and so on. Due to the intrinsic uneven attribute, namely the extraordinary difference between the amount of positive samples and negative samples, it leads to the reduction of the tradition classification algorithm's performance, so how to effectively and accurately classify the imbalanced dataset has become a hot research problem in the machine learning and pattern recognition field.
     On the basis of tradition kernel method, this paper propose a classification learning algorithm, which integrates a new over-sampling method and the Support Vector Machine with different costs, to achieve the target of improving the imbalanced dataset classification performance. Main works are studied follows:
     (1) Aim at the imbalance problem of imbalanced dataset, this paper proposes a method of data processing in the image space of kernel method, namely SMOIS (Synthetic Minority Over-sampling in Image Space). This method which is different from the strategy of synthesizing minority samples in the original data space brings in non-repetitive synthetic minority samples in the image space after mapped and thus reducing the sensitive of minority sample of classification algorithm. The experiment results show that this method has a better classification performance according to the evaluation on roc curve and g-means.
     (2) Support Vector Machine (SVM) is an effective classification learning algorithm, but usually obtains an unsatisfactory performance in face of the imbalanced dataset. Consequently this paper proposes a new SVM learning algorithm based on the SMOIS to improve the performance of classification, which integrate the SMOIS method and revised SVM algorithm.
     The researches in this paper are the one of currently key problems. It has important theoretical significance, and also has direct application value for real-world problems.
引文
[1]张琦,吴斌,王柏.非平衡数据训练方法概述[J].计算机科学,2005,Vo132No.1:181-186
    [2]赵玲玲,翁苏明,曾华军.模式分析的核方法[M].北京:机械工业出版社,2006
    [3]蒋莎.一种用于学习非平衡数据支持向量机的改进[D].武汉:武汉科技大学,2008
    [4]牟少敏.核方法的研究及其应用[D].北京:北方交通大学,2008
    [5]Bernhard,Scholkopf.Alexander,J Smola.Learning with Kernels[M].Massachusetts:The MIT Press,2002:87
    [6]Akbani,R.Kwek,S.Japkowicz,N.Applying Support Vector Machine to Imbalanced Datasets[A].In:European Conference on Machine Learning[C],2004,39-5.
    [7]Wu,G..Chang,E.Class-Boundary Alignment for Imbalanced Dataset Learning[A].In:International Conference on Machine Learning[C],2003
    [8]Chawla,N.Bowyer,K.Hall,L.Kegelmeyer,W.SMOTE:Synthetic Minority Over-Sampling Technique[J],Journal of Artificial Intelligence Research,2002,16,321-357.
    [9]Mercer,J.Functions of positive and negative type and their connection with the theory of integral equations[J].Philos.Trans.Roy.Soc.London,1909,A 209:415-446,
    [10]C.J.Merz,P.M.Murphy.UCI repository of machine learning databases[DB],Department of Imformation and Computer Sciences,University of California,Irvine.http://www.ics.uci.edu/~mlearn/MLSummary.html,1998
    [11]S.Canu,Y.Grandvalet,V.Guigue,A.Rakotomamonjy.SVM and Kernel Methods Matlab Toolbox[CP],Perception Syst(?)mes et Information,INSA de Rouen,Rouen,France.
    [12]Amari,S.Wu,S.Improving support vector machine classifiers by modifying kernel function[J].Neural Networks,12,783-789.
    [13]Kubat,M.& Matwin,S.Addressing the Curse of Imbalanced Training Sets:One-Sided Selection[A].In:Proceedings of the 14th International Conference on Machine Learning[C].1997.
    [14]Wu,G.,& Chang,E..Adaptive feature-space conformal transformation for imbalanced data learning[A].In Proc.of the 20th International Conference on Machine Learning[C].2003
    [15]Wu,S.,& Amari,S..Conformal transformation of kernel functions:A data-dependent way to improve the performance of support vector machine classifiers[J].Neural Processing Letter,2002,15.
    [16]Kaizhu Huang,Haiqin Yang,Irwin King.Correspondence:Imbalanced Learning with a Biased Minimax Probability Machine[J].IEEE Transactions on Systems,Man,and Cybernetics.2006,VOL.36,NO.4:913.
    [17] Richard O.Duda, Peter E.Hart, David G. Stork. Pattern Classification[M], Second Edition. 2001
    [18] Nugroho. A, Kuroyanagi, S, Iwata, A. A solution for imbalanced training sets problem by combnet-ii and its application on fog forecasting[J]. IEICE Transactionon Information and Systems, 2002,E85-D.
    [19] Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano, Experimental Perspectives on Learning from Imbalanced Data[A]. In Procceeding of the 24~(th) International Conference on Machine Learning[C], 2007
    [20] Nitesh V.Chawla C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure[A]. In:The International Conference on Machine Learning[C], Washington DC, 2003
    [21] Vapnik,V. The nature of statistical learning theory[M]. New York: Springer Verlag Press, 1995
    [22] Japkowicz, N. Learning from imbalanced data sets: a comparison of various strategies[A]. In AAAI Workshop on Learning from Imbalanced Data Sets[C]. AAAI'00.2000,pp.10-15.
    [23] Han, H., Wang, W. Y., & Mao, B. H.. Borderline-smote: A new over-sampling method in imbalanced data sets learning[A]. In International Conference on Intelligent Computing (ICIC'05)[C].Lecture Notes in Computer Science, Springer-Verlag, 2005
    [24] Boser,B.E., Guyon,I.M., Vapnik,V.N. A training algorithm for optimal margin classifiers[A]. In Proceedings of the 5~(th) Annual ACM Workshop on Computational Learning Theory[C], Pittsburgh, PA: ACM Press, 1992, pp. 144-152,
    [25] Osuna,E., Freund,R., Girosi,G. Improved training algorithm for support vector machines[A]. In Proc. IEEE NNSP'97[C]. Amelia Island. 1997,pp. 24-26,
    [26] Platt, J. Fast training of support vector machines using sequential minimal optimization[A]. In advances in Kernel Methods Support Vector Learning[M], Cambridge, MA: MIT Press. 1999, Pages 185-208,.
    [27] Cortes,C., Vapnik,V. Support vector networks[J]. Machine Learning, 1995, 20:1-25.
    [28] Schmidt,M. Identifying speaker with support vector networks[A]. In Interface '96 Proceedings[C], Sydney, 1996.
    [29] Osuna,E., Freund,R., Girosi,G. Training support vector machines: an application to face detection[A]. In International Conference on Computer Vision and Pattern Recognition[C], 1997, pp. 130-136.
    [30] Joachims,T. Text categorization with support vector machines[R]. Technical Report, LS Ⅷ Number 23, university of Dortmund, 1997
    [31] Scholkopf,B. Smola,A., Muller,K.R. Nonlinear component analysis as a kernel eigenvalue problem[J]. Neural Computations, 1998, 10:1299-1319,
    [32] Mika,S., Ratsch,G., Weston,J., Scholkopf,B. Muller,K.R. Fisher discriminant analysis with kernels[J]. Neural Networks for Signal Processing IX, IEEE, 1999, pp.41-48.
    [33] Aizerman,M.A., Braverman,E.M., Rozonoer,L.I. Theoretical foundation of the potential function method in pattern recognition learning[J]. Automation and Remote Control, 1964,25:821-837.
    [34] Batista, G.; Prati, M, and Monard, M. A study of the behavior of several methods for balancing machine learning training data[A]. In: SIGKDD Explorations[C], 2004 6(1):20-29.
    [35] Japkowicz, N. Class imbalances: Are we focusingon the right issue?[A], In proc. of the ICML-2003 Workshop: Learning with Imbalanced Data Sets Ⅱ[C], 2003.17-23.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700