关于非平衡数据特征问题的研究

英文题名：Research on Feature Selection Methods of Imbalanced Date
作者：尹留志
论文级别：博士
学科专业名称：概率论与数理统计
中文关键词：非平衡数据 ; 特征选择 ; Hellinger距离 ; AUC ; F-测量
英文关键词：Imbalanced Data ; Feature Selection ; Hellinger Distance ; AUC ; F-measure
学位年度：2014
导师：方兆本
学科代码：070103
学位授予单位：中国科学技术大学
论文提交日期：2014-05-08

摘要

数据集合的非平衡性指不同类型的样本量的大小较为悬殊。近年来,非平衡数据分类问题的重要性已经引起了广泛关注。然而,对于高维非平衡数据分类特征选择技术的研究并不多见。本文在回顾了非平衡数据已有方法的同时,介绍了两种新的应对方法,分别是基于类型分解的特征选择方法,以及基于Hellinger距离的特征选择方法。
数据的不平衡性在现实问题中较为常见,同时针对非平衡数据的分类往往具有重要意义,因为这些少数类常常对应着较为重要的错分代价,遗漏或者错分都会带来较为严重的后果。论文第1章介绍了非平衡问题的实例,回顾了机器学习以及数据挖掘领域对非平衡问题的解决方法。从方法论的角度大致可以总结为五类,分别为抽样方法、训练集合分解方法、代价敏感度学习方法、分类器集成方法以及特征选择方法。我们在第2章综述性地介绍了前四种方法,在第3章综述性介绍了已有的特征选择方法,包括Case-Specific-IG方法、RELIEF方法、FAST方法以及一种特征选择框架。论文第4章主要介绍了提出的两种新的特征选择方法。首先在类别分解的基础上提出了一个新的特征选择方法,具体来讲,就是我们将大的类别分割成相对小的伪子类然后相应生成伪类标签,进而降低了数据的不平衡性,再通过特征选择度量对新分解的数据的特征进行选择,并基于此给出分类；其次我们介绍了基于Hellinger距离的特征选择方法,Hellinger距离度量了两个分布之间的距离,因此对于两类问题来说,如果出现非平衡性,并不影响其分布之间的距离的度量,因此该距离对于非平衡性并不敏感,可以作为度量特征和类型之间相关性的较好度量。我们提出的两种方法在往年KDDCup数据集合上均取得了较已有特征选择方法更好的分类结果。
Imbalanced data problem raised great attentions in machine learning and data mining fields in recent decades. However, few efforts have been made on high dimensional feature selection of imbalanced dataset. This paper reviews existing methods, also raise two new methods for feature selection, one is the class decomposition method, and one is Hellinger distance method.
In practical applications, the datasets are often imbalanced, the positive samples are rare, also the misclassification or omitting of these rare samples will lead to serious cost. Hence the research on classification of imbalanced data is economically meaningful. In chapter one we introduce some examples and briefly review the existing methods for imbalanced data. Methods could be classified into the five methods, they're sampling method, data segmentation, cost-sensitive method, classifier combination method, and feature selection method. Chapter two reviews the first four methods, Chapter three reviews the feature selection method, including the Case-Specific-IG method, RELIEF method, FAST method and one framework for feature selection. In Chapter four, we raise the two new feature selection methods. The first method, the class decomposition method cuts classes into small pseudo classes using clustting algorithms and then using traditional feature selection measures to select those features which is significant for distinguishing classes. The second method uses Hellinger distance as measure for selection. Hellinger measures the distance between two distributions, hence the imbalance between two classes will not change the measure between the distributions of the two classes, hence Hellinger distance is skew-insensitive measure. The classification performance of the two raised methods applied to dataset of former years KDDCup are better compared with existing feature selection methods.

引文

Ali K, Pazzani M.1995. HYDRA-MM:learning multiple descriptions to improve classification accuracy[J]. International Journal on Artificial Intelligence Tools,4(01n02):115-133.
    Agrawal R, Imielinski T, Swami A.1993. Mining association rules between sets of items in large databases[C]//ACM SIGMOD Record. ACM,22(2):207-216.
    Bradley A P.1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms[J]. Pattern recognition,30(7):1145-1159.
    Buckland M K, Gey F C.1994. The relationship between recall and precision[J]. JASIS,45(1): 12-19.
    Carvalho D R, Freitas A A.2002. A genetic-algorithm for discovering small-disjunct rules in data mining[J]. Applied Soft Computing,2(2):75-88.
    Carvalho D R, Freitas A A.2004.New results for a hybrid decision tree/genetic algorithm for data mining[M]//Applications and Science in Soft Computing. Springer Berlin Heidelberg: 149-154.
    Chan P K, Stolfo S J. Toward.1998. Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]//KDD.1998:164-168.
    Chang C and Lin C.2001. Libsvm:a library for support vector machines [J/OL].Available at http://www.csie.ntu.edu.tw/～cilin/papers/libsvm.pdf.
    Chawla N. V., Bowyer K. Hall. W. Land Kegelmeyer W. P.2002. SMOTE:Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research,16:321-357.
    Chawla N V.2003.C4.5 and imbalanced data sets:investigating the effect of sampling method, probabilistic estimate, and decision tree structure[C]//Proceedings of the ICML.3.
    Chawla N V, Lazarevic A, Hall L O, et al.2003. SMOTEBoost:Improving prediction of the minority class in boosting[M]//Knowledge Discovery in Databases:PKDD 2003. Springer Berlin Heidelberg:107-119.
    Chen X, Wasikowski M.2008. FAST:A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems[J]//KDD'08, August 24-27, Las Vegas, Nevada, USA.
    Chow T, Wang P and Ma E.2008. A new feature selection scheme using a data distribution factor for unsupervised nominal data[J]. PAMI,38(2):499-509.
    Cieslak D.A. and Chawla N.V.2008. Learning decision trees for unbalanced data[C]. In Machine Learning and Knowledge Discovery in Databases, lecture notes in Computer Science,5211: 241-256.
    Dhillon I, Mallela S and Kumar R.2003. A divisive information-theoretic feature clustering algorithm for text classification[J/OL]. JMLR 3:1265-1287. Available at http://jmlr.org/papers/v3/dhillon03a.html.
    Drummond C and Holte R.2003. C4.5, class imbalanced, and cost sensitivity:why under-sampling beats over-sampling[R]. Workshop on Learning from Imbalanced Datasets Ⅱ11:1-8.
    Elkan C.2001. The foundations of cost-sensitive learning[C]//International joint conference on artificial intelligence. LAWRENCE ERLBAUM ASSOCIATES LTD,17(1):973-978.
    Estabrooks A, Japkowicz N.2001. A mixture-of-experts framework for learning from imbalanced data sets[M]//Advances in Intelligent Data Analysis. Springer Berlin Heidelberg:34-43.
    Fan W, Stolfo S J, Zhang J, et al.1999. AdaCost:misclassification cost-sensitive boosting[C]//ICML.:97-105.
    Forman G.2003. An extensive empirical study of feature selection metrics for text classification [J]. The Journal of machine learning research,3:1289-1305.
    Freitas A A.2002. Evolutionary computation[C]//Handbook of Data Mining and Knowledge Discovery. Oxford University Press, Inc:698-706.
    Friedman J H, Kohavi R, Yun Y..1996. Lazy decision trees[C]//AAAI/IAAl, Vol.1:717-724.
    Goldberg D E.1989.Genetic algorithms in search, optimization, and machine learning[M]. Reading Menlo Park:Addison-wesley.
    Guyon I and Elissee A.2003. An introduction to variable and feature selection[J]. The Journal of Machine Learning Research.3:1157-1182.
    Holte R C, Acker L, Porter B W.1989. Concept Learning and the Problem of Small Disjuncts[C] //IJCAI.89:813-818.
    Japkowicz N, Myers C, Gluck M.1995.A novelty detection approach to classification[C]//UCAI.: 518-523.
    Japkowicz N.2001.Concept-learning in the presence of between-class and within-class imbalances[M]//Advances in Artificial Intelligence. Springer Berlin Heidelberg:61-11.
    Japkowicz N.2002. Supervised learning with unsupervised output separation[C]//International Conference on Artificial Intelligence and Soft Computing.3:321-325.
    Japkowicz N, Stephen S.2002. The class imbalance problem:A systematic study[J]. Intelligent data analysis,6(5):429-449.
    Joshi M V, Agarwal R C, Kumar V.2001.Mining needle in a haystack:classifying rare classes via two-phase rule induction[C]//ACM SIGMOD Record. ACM,30(2):91-102.
    Joshi M.V., Agarwal R.C. and Kumar V.2002. Predicting rare classes:Can boosting make any weak learner strong? In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining:297-306.
    Joshi M V, Kumar V, Agarwal R C.2001. Evaluating boosting algorithms to classify rare classes: Comparison and improvements[C]//Data Mining,2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE:257-264.
    Kailath T.1967. The divergence and bhattacharyya distance measures in signal selection[J]. Communication Technology, IEEE Transactions on 15(1):52-60.
    Kohavi R.1998.Data mining with MineSet:what worked, what did not, and what might[C]. In Workshop on Commercial Success of Data Mining, Fourth International Conference on Knowledge Discovery and Data Mining.
    Kubat M and Matwin S.1997. Addressing the curse of imbalanced training sets:one-sided selection. In Proceedings of the 14th Annual International Conference on Machine Learning.
    Kubat M, Holte R, Matwin S.1997. Learning when negative examples abound[M]//Machine Learning:ECML-97. Springer Berlin Heidelberg:146-153.
    Kubat M, Holte R C, Matwin S.1998. Machine learning for the detection of oil spills in satellite radar images[J]. Machine learning,30(2-3):195-215.
    Lewis D D, Catlett J.1994. Heterogenous Uncertainty Sampling for Supervised Learning[C]//ICML.94:148-156.
    Ling C Xand Li C.1998. Data mining for direction marketing:Problem and solutions[J/OL]. Available at http://www.aaai.org/Papers/KDD/1998/KDD98-011.pdf.
    Liu B, Hsu W, Ma Y.1999. Mining association rules with multiple minimum supports[C] //Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM:337-341.
    Liu X, Wu J and Zhou Z.2009. Exploratory undersampling for class-imbalance learning. PAMI, 39(2):539-550.
    Margineantu D and Dietterich T.1999. Learning decision trees for loss minimization in multi-class problems[R]. Oregon State University.
    Malina W.1981. On an extended fisher criterion for feature selection. PAMI,3(5):661-614.
    Mladenic D, Grobelnik M.1999. Feature selection for unbalanced class distribution and naive bayes[C]//ICML.99:258-267.
    Pang Y, Yuan Y and Li X.2008. Effective feature extraction in high-dimensional space. IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics:a Publication of the IEEE Systems, Man, and Cybernetics Society.38(6):1652-1656.
    Pazzani M J, Merz C J, Murphy P M, et al.1994. Reducing Misclassification Costs[C]//ICML.94: 217-225.
    Petricoin III E F, Ardekani A M, Hitt B A, et al.2002. Use of proteomic patterns in serum to identify ovarian cancer[J]. The lancet,359(9306):572-577.
    Petricoin E F, Ornstein D K, Paweletz C P, et al.2002. Serum proteomic patterns for detection of prostate cancer[J]. Journal of the National Cancer Institute,94(20):1576-1578.
    Pomeroy, S, Tamayo P, et al.2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature,415:436-442.
    Pomeroy, S. et al.2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature,415:436-442.
    Provost F, Domingos P.2003. Tree induction for probability-based ranking[J]. Machine Learning, 52(3):199-215.
    Provost F, Fawcett T.2001. Robust classification for imprecise environments [J]. Machine Learning,42(3):203-231.
    Quinlan J R.1991. Improved estimates for the accuracy of small disjuncts[J]. Machine Learning, 6(1):93-98.
    Quinlan J R.1993. C4.5:programs for machine learning[M]. Morgan kaufmann.
    Radivojac P, Chawla N V, Dunker A K, et al.2004. Classification and knowledge discovery in protein databases[J]. Journal of Biomedical Informatics,37(4):224-239.
    Raskutti B, Kowalczyk A.2004. Extreme re-balancing for SVMs:a case study[J]. ACM Sigkdd Explorations Newsletter,6(1):60-69.
    Riddle P, Segal R, Etzioni O.1994. Representation design and brute-force induction in a Boeing manufacturing domain[J]. Applied Artificial Intelligence an International Journal,8(1): 125-147.
    Schapire R E.1999. A brief introduction to boosting[C]//Ijcai.99:1401-1406.
    Shipp M.A, et al.2002. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine 8:68-74.
    Singhi S.K. and Liu H.2006. Feature subset selection bias for classification learning. In Proceeding ICML'06, In Proceedings of the 23rd international conference on Machine learning.849-856.
    Tan P.N. and Steinbach M.2005. Introduction to Data Mining[M]. Addison-Wesley.
    Tang Y, Zhang Y and Chawla N.V.2009. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B.39(1):281-288.
    Ting K M.1994.The problem of small disjuncts:Its remedy in decision trees [C]. In Proceedings of the Tenth Canadian Conference on Artificial Intelligence:91-97
    Torkkola K.2003. Feature extraction by non-parametric mutual information[J]. The Journal of Machine Learning Research.3:1415-1438.
    Van Rijsbergen C J.1979. Information Retrieval[M]. Butterworths, London
    Van Den Bosch A, Weijters A, Van Den Herik H J, et al..1997. When small disjuncts abound, try lazy learning:A case study[C]//Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning:109-118.
    Wang M, Hua X, Tang J.H. and Hong R.2009. Beyond Distance Measurement:Constructing Neighborhood Similarity for Video Annotation[J]. IEEE Transactions on Multimedia,11(3): 465-476.
    Weiss G M.1995. Learning with rare cases and small disjuncts[C]//ICML:558-565.
    Weiss G M.1999. Timeweaver:A genetic algorithm for identifying predictive patterns in sequences of events[C]//Proceedings of the Genetic and Evolutionary Computation Conference.1:718-725.
    Weiss G.M.2004. Mining with rarity:a unifying framework[J]. ACM SIGKDD Explorations Newsletter,6(1):7-19.
    Weiss G M, Hirsh H.2000. A quantitative study of small disjuncts[C]//AAAI/IAAI:665-670.
    Weiss G M, Provost F J.2003. Learning when training data are costly:The effect of class distribution on tree induction[J]. J. Artif. Intell. Res.(JAIR),19:315-354.
    William C.1995. Fast effective rule induction[C]//Twelfth International Conference on Machine Learning:115-123.
    Witten I.H. and Frank E.2005. Data Mining:Practical machine learning tools and techniques[M]. Morgan Kaufmann.
    Xu Z, Jin R, Ye J, et al.2009.Non-monotonic feature selection[C]//Proceedings of the 26th Annual International Conference on Machine Learning. ACM,:1145-1152.
    Yan R, Liu Y, Jin R, et al.2003. On predicting rare classes with SVM ensembles in scene classification[C]//Acoustics, Speech, and Signal Processing,2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on. IEEE,3:Ⅲ-21-4 vol.3.
    Yang Y, Pedersen J O.1997. A comparative study on feature selection in text categorization[C]//ICML.97:412-420.
    Yang Y, Shen H T, Ma Z, et al.2011.12,1-norm regularized discriminative feature selection for unsupervised learning[C]//Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Two. AAAI Press:1589-1594.
    Yang Y, Wu F, Nie F, et al.2012. Web and personal image annotation by mining label correlation with relaxed visual graph embedding[J]. Image Processing, IEEE Transactions on,21(3): 1339-1351.
    Zadrozny B, Elkan C.2001.Learning and making decisions when costs and probabilities are both unknown[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM:204-213.
    Zheng Z, Wu X, Srihari R.2004. Feature selection for text categorization on imbalanced data[J]. ACM SIGKDD Explorations Newsletter,6(1):80-89.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700