特征选择方法及其在红斑鳞状皮肤病诊断中的应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
特征选择作为一种数据预处理的重要方法,是监督学习算法中的一个重要组成部分,在数据挖掘、机器学习,模式识别等相关领域的研究和应用中有重要地位。近年来,图像处理、文本识别、基因表达等大规模问题的不断出现,特征选择算法越来越受到人们的重视,并对其提出了严峻的挑战,寻找能够适应大规模数据的准确性和运行效率等综合性能较好的特征选择方法成为一种迫切的需要。本文对高维数据的特征选择算法作了一些研究,提出了一种适用于多类别模式识别问题特征选择的特征重要性度量策略,并将所提出的特征选择算法应用到红斑鳞状皮肤病诊断中研究中。本文的主要工作包括以下几个部分。
     首先,对目前特征选择的研究现状和问题进行了具体而又深入的研究,分析了特征选择的定义,特征选择算法与特征提取的关系,特征选择的四个方面,特征选择的两种模式,归纳了几种常见的搜索算法,并提出了特征选择算法的选用技巧。
     其次,提出了一种改进的F-score特征选择方法。传统的F-score特征选择方法是度量样本特征在两类之间的辨别能力的方法,本文对其进行推广,提出了改进的F-score,使其不但能够评价样本特征在两类之间的辨别能力,而且能够度量样本特征在多类之间的辨别能力大小。另外,结合Filter和Wrapper各自的优缺点,提出了基于IFSFS (Improved F-score and Sequential Forward Search(顺序前进法))与SVM (Support Vector Machines,支持向量机)的特征选择方法。它以改进的F-score作为特征选择准则,顺序前进法(SFS)作为特征选择的搜索方法,用支持向量机作为分类方法来评估特征子集的有效性,实现有效的特征选择,并将该方法应用到红斑鳞状皮肤病的诊断中。通过实验结果证明该特征选择方法的有效性。
     最后,针对SFS的主要缺点,即一旦某个特征已入选,即使由于后加入的特征使它变为多余,也无法再把它剔除,本文提出了基于IFSFFS(Improved F-score and Sequential Forward Floating Search(顺序前进浮动搜索))与SVM相结合的特征选择方法。将IFSFFS+SVM特征选择方法应用到红斑鳞状皮肤病诊断中进行实验测试发现,该方法取得了非常好的诊断效果。
In the field of data mining, machine learning and pattern recognition, feature selection, as an important way of data preprocessing, is an essential part of supervised learning algorithm. In recent years, the emerging of some large scale datasets, especially in image processing or gene expressing, feature selection has become a very popular area and faced more challenge. Now it is necessary to develop a feature selection algorithm with high accuracy and efficiency to implement the reduction for high dimensional dataset. This thesis focused on feature selection research on high dimensional dataset, and proposed new feature selection algorithms to diagnose erythemato-squamous diseases. The contributions of this dissertation mainly include the following parts.
     Firstly, this thesis made a specific and in-depth analysis on current focusing problems in feature selection area. Then we explained the definition of feature selection, and described the difference between feature selection and feature extraction, and introduced four aspects of feature selection methods and Filter and Wrapper feature selection methods. After that, we introduced some conventional feature selection search strategies, and put forward the skills of using them.
     Secondly, an improved F-score feature selection method was proposed in this thesis. The Origin F-score is a simple technique which measures the discrimination of two sets of real numbers. The improved F-score we proposed can measure the discrimination of more than two sets of real numbers.
     Thirdly, Based on the merits and demerits of filter and wrapper feature selection model, a coupling model for feature selection was proposed in this thesis. This model combed IFSFS (Improved F-score and Sequential Forward Search) and SVM (Support Vector Machines) to finish the process of feature selection. Where the improved F-score is used as an evaluate criterion of feature selection, SFS is regarded as search method in feature selection processing, and SVM is used to evaluate the features selected via the improved F-score. And then, the dermatology data of erythemato-squamous in UCI database was used to test our proposed feature selection model. The experiment results demonstrated that the model based on IFSFS and SVM is efficient in diagnosing the erythemato-squamous diseases and achieves high classification accuracy.
     Finally, due to the disadvantage of SFS, where once the feature is selected, it will not be deleted from the selected features, the thesis proposed another feature selection method, based IFSFFS (Improved F-score and Sequential Floating Forward Search) and SVM. The experiment results on diagnosing erythemato-squamous diseases demonstrate the feature selection method combing IFSFFS and SVM is more efficient and achieves higher classification accuracy.
引文
[1]Lee M.C. Using support vector machine with a hybrid feature selection method to the stock trend prediction [J]. Expert Systems with Applications,2009,36(8): 10896-10904.
    [2]Maldonado S, Weder R. A wrapper method for feature selection using support machines [J]. Information Sciences,2009,179(13):220-2217.
    [3]Liu Y, Zheng Y. F. FS_SFS:A novel feature selection method for support vector machines [J]. Pattern Recognition,2006,39(7):1333-1345.
    [4]Mao K.Z. Feature subset selection for support vector machines through discriminative function pruning analysis [J]. IEEE Transactions on Systems, Man, and Cybernetics-Part B:Cybernetics,2004,34(1):60-67.
    [5]王华忠,俞金寿.基于PCA-SVM的软测量建模方法与应用[J].自动化仪表,2004.
    [6]Hua J.P, Tembe W.D, Dougherty E.R. Performance of feature-selection methods in the classification of high-dimension data [J]. Pattern Recognition, 2009,42(3):409-424.
    [7]Gunal S, Gerek O.N, Eced G, Edizkan R. The search for optimal feature set in power quality event classification [J]. Expert Systems with Applications,2009, 36(7):10266-10273.
    [8]Widodo A, Yang B.S. Application of nonlinear feature extraction and support vector machines for fault diagnosis of induction motors [J]. Expert Systems with Applications,2007,33(1):241-250.
    [9]张杰慧.特征选择算法研究及其在孤立肺结节诊断中的应用[D].重庆:重庆大学,2007.
    [10]Guyon I, Elisseeff A. An introduction to variable and feature selection. Machine Learning Research [J],2003,3:1157-1182.
    [11]Kittler J. Feature set search algorithms [J]. Pattern Recognition and Signal Processing,1978:41-60.
    [12]Kira K, Rendell L.A. The Feature Selection Problem:Traditional Methods and a New Algorithm. Processing of 9th National Conference on AI. San Jose. 1992:192-134.
    [13]Koller D, Sahami M. Toeard Optimal Feature Selection. Processing of International Conference on Machine Learning, Vienna,1996:284-292.
    [14]Dash M, Liu H. Feature Selection for Classification. Intelligent Data Analysis, 1997,1 (3):131-156.
    [15]Xing E, Jordan M, Karp R. Feature selection for high-dimensional genomic microarray data[C]. In:Brodley CE, Danyluk A P, Eds. Proceedings of the 18th International Conference on Machine Learning, San Fransisco:Morgan Kaufmann,2001:601-60.
    [16]张丽新,王家钦,赵雁南.机器学习中的特征选择[J].计算机科学,2004,31(11):180-184.
    [17]Langley P, Sage S. Scaling to domains with many irrelevant features [J]. Computational Learning Theory and Natural Learning Systems, Cambridge, MA:MIT Press,1997(4).
    [18]Kittler J. Feature set search algorithms [J]. Pattern Recognition and Signal Processing,1978:41-60.
    [19]Cover T.M. The best two independent measurements are not the two best [J]. IEEE Transactions on System, Man and Cybernetics,1974,4(1):116-117.
    [20]Somol P, Pudil P, Novovicova J, Paclik P. Adaptive floating search methods in feature selection [J]. Pattern Recognition Letters,1999,20:1157-1163.
    [21]Almuallim H, Dietterich T.G. Efficient algorithms for identifying relevant features [C]. Proceedings of the 9th Canadian Conference on Artificial Intelligence, San Fransisco:Morgan Kaufmann,1992:38-45.
    [22]Kira K, Rendell L. A practical approach to feature selection [C]. Proceedings of the Ninth International Conference on Machining Learning,249-256.
    [23]Kononenko I. Estimating attributes:Analysis and extension of RELIEF [C]. Proceedings of the European Conference on Machine Learning,1994:171-182.
    [24]Liu H, Setiono R. Feature selection and classification-a probabilistic wrapper approach [C]. Proceedings of the Ninth International Conference on Industrial and Engineering Applications of AI and ES,1996.
    [25]Liu H, Setiono R. Scalable feature selection for large size database [C]. Morgan Kaufmann Publishers, Proceedings of the Fourth World Congress on Expert Systems,1998.
    [26]边肇祺,张学工.模式识别[M].北京:清华出版社,1999.
    [27]刘素华,侯惠芳,李小霞.基于遗传算法和模拟退火算法的特征选择方法[J].计算机工程,2005,16:157-159.
    [28]赵云,刘惟一.基于遗传算法的特征选择方法[J].计算机工程与应用,2004,15:52-54.
    [29]凌锦江,陈兆乾,周志华,基于特征选择的神经网络集成方法[J].复旦大学学报,2004,5:685-688.
    [30]Moore A W, Lee M S. Efficient algorithms for minimizing cross validation error [C]. In:The Eleventh Intl.Conf.on Machine Learning,1994:190-198.
    [31]Weston J, Mukherjee S, Chapelle O. Feature Selection for SVMs [J]. Advances in neural information processing systems, Cambridge, MA:MIT Press,2000, 13:668-674.
    [32]Lee H M, Chen C M, et al. An efficient fuzzy classifier with feature selection based on fuzzy entropy [J]. IEEE Trans.on systems and cybernetics-Part B: Cybernetics,2001,31(3):26-432.
    [33]Anis B I, Badih G. An Efficient Method for Variable Selection Using SVM-Based Criteria [J]. Journal of Machine Learning Research,2005:1-31.
    [34]Alanin R. Variable selection Using SVM-based Criteria [J]. Journal of Machine learning Research,2003,3:1357-1370.
    [35]Jennifer G. D, Carla E B. Feature Selection for Unsupervised Learning [J]. Journal of Machine Learning Research,2004,5:845-889.
    [36]Lior W, Amnon S. Feature Selection for Unsupervised and Supervised Inference:The Emergence of Sparsity in a Weight-Based Approach [J]. Journal of Machine Learning Research,2005,6:1885-1887.
    [37]张莉,孙钢,郭军.基于K-均值聚类的无监督的特征选择方法[J].计算机应用研究,2005,3:23-24.
    [38]Dash M, Liu H. Feature selection for clustering [C]. In:4th Pacific Asia Conference on Knowledge Discovery and Data Mining,2000:110-121.
    [39]Linli X. Dale S. Unsupervised and Semi-supervised Multi-class Support Vector Machines. American Association or Artificial Intelligence 2005, www.aaai.org.
    [40]Lewis P M. The characteristic selection problem in recognition system [J]. IRE Transaction on Information Theory,1962,8:171-178.
    [41]John G., Kohavi R, Pfleger K. Irrelevant features and the subset selection problem [C]. The Eleventh International Conference on Machine Learning, 1994:121-129.
    [42]Guvenir H.A, Demiroz G., Ilter N. Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals [J]. Artificial Intelligence in Medicine,1998,13:147-165.
    [43]Ubeyli E.D, Guler I. Automatic detection of erythematosquamous diseases using adaptive neuro-fuzzy inference systems [J]. Computers in Biology and Medicine,2005,35:421-433.
    [44]Luukka P, Leppalampi T. Similarity classifier with generalized mean applied to medical data [J]. Computers in Biology and Medicine,2006,36:1026-1040.
    [45]Polat K, Gunes S. The effect to diagnostic accuracy of decision tree classifier of fuzzy and k-NN based weighted pre-processing methods to diagnosis of erythemato-squamous diseases [J]. Digital Signal Processing,2006,16: 922-930.
    [46]Nanni L. An ensemble of classifiers for the diagnosis of erythemato-squamous diseases [J]. Neurocomputing,2006,69:842-845.
    [47]Luukka P. Similarity classifier using similarity measure derived from Yu's norms in classification of medical data sets [J]. Computers in Biology and Medicine,2007,37:1133-1140.
    [48]Ubeyli E.D. Multiclass support vector machines for diagnosis of erythemato-squamous diseases [J]. Expert Systems with Applications,2008,35: 1733-1740.
    [49]Polat K, Gunes S. A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems [J]. Expert Systems with Applications,2009,36:1587-1592.
    [50]Ubeyli E.D. Combined neural networks for diagnosis of erythemato-squamous diseases [J]. Expert Systems with Applications,2009,36:5107-5112.
    [51]Liu H.W, Sun J.G., Liu L, Zhang H.J. Feature selection with mutual information [J]. Pattern Recognition,2009,42:1330-1339.
    [52]Karabatak M, Ince M.C. A new feature selection method based on association rules for diagnosis of erythemato-squamous diseases [J]. Expert Systems with Applications,2009,36:12500-12505.
    [53]Dash M, Liu H. Consistency-based search in feature selection [J]. Artificial intelligence,2003(151):155-157.
    [54]李保洋.特征选择在中医数据挖掘中的应用研究[D],北京:北京交通大学,2008.
    [55]孙伟艳.模式分类中特征选择问题的研究[D],黑龙江:哈尔滨理工大学,2009.
    [56]彭佳红,沈岳,张林峰.数据挖掘中的特征选择及其算法研究[J].计算机工程与设计,2006,26(5):1176-1178.
    [57]Han J.W, Kamber M. Data Mining concepts and Techniques [M],2nd edition, Morgan Kaufmann,2006.
    [58]Han J.W, Kamber M.范明,孟小峰(译).数据挖掘概念与技术[M].北京:机械工业出版社,2006.
    [59]苏映雪.特征选择算法研究[D],湖南:国防科技大学研究生院,2006.
    [60]Ku Long Ho, Yuan Yth Hsu, Chuan Fu Chen, et al. Short term Load Forecasting of Taiwan Power System Using a Knowledge-based Expert Systems [J]. IEEE Trans on Power Systems,1990,5(4):1214-1221.
    [61]Huang Y, Shian-Shyong T, Wu G, et al. A two-phase feature selection methods using both filter and wrapper[C]. Proceedings of the 1999 IEEE International Conference on Systems, Man, and Cybernetics.1999,2:132-136.
    [62]Das S. Filters, wrappers and a boosting-based hybrid for feature selection[C]. Proceeding of the 18th International Conference on Machine Learning, San Fransisco:Morgan. Kaufmann,2001:74-81.
    [63]Gunal S, Gerek, O.N., Ece D.G., Edizkan R. The search for optimal feature set in power quality event classification [J]. Expert Systems with Applications, 2009,36:10266-10273.
    [64]Pudil P, Novovicova J, Kittler J. Floating search methods in feature selection [J]. Pattern Recognition Letters,1994,15(11):1119-1125.
    [65]Chen Y.W, Lin C.J. Combining SVMs with various feature selection strategies [EB/OL]. http://www.csie.ntu.edu.tw/-cjlin/papers/features. pdf,2009-8-10.
    [66]Akay M.F. (2009). Support vector machines combined with feature selection for breast cancer diagnosis [J]. Expert Systems with Applications,36: 3240-3247.
    [67]Vapnil V.N. The Nature of Statistical Learning Theory [M]. New York: Springer,1995.
    [68]Burges C. A tutorial on support vector machines for pattern recognition [J]. Data Mining and Knowledge Discovery,1998,2:121-167.
    [69]Huang J, Shao X. H, Wechsler, H. Face pose discrimination using support vector machines (SVM) [C]. Processing of 14th International Conference on Pattern Recognition,1998:154-156.
    [70]Basu A, Watters C, Shepherd M. Support vector machines for text categorization [EB/OL]. http://users.cs.dal.ca/-shepherd/pubs/support_vectors. pdf,2003.
    [71]Wan V, Campbell W.M. Support vector machines for speaker verification and identification. Proceedings of IEEE Workshop Neural Networks for Signal Processing,2002:775-784.
    [72]Zhang L, Liao B, Li D.C, Zhu W. A novel representation for apoptosis protein subcellular localization prediction using support vector machine. Journal of Theoretical Biology,2009,259:361-365.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700