基于最大相关最小冗余的特征选择算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
特征选择,即从原始特征集中选出最优特征子集是模式识别领域的关键问题。如在生物信息学研究领域,面向基因表达或蛋白质质谱这种小样本高维数据,高效的特征选择算法更显得尤其重要。
     特征选择也是设计一个性能优良的分类器的前提与必要。如支持向量机分类器的计算复杂度以及训练时间是随着训练样本的数目和输入空间维数呈现非线性变化的。因此,对训练集信息进行合理的预处理是提高支持向量机性能的一个重要途径。合理而有效地选择特征,适当减少特征维数,一方面可以消除冗余,加快运算速度,提高分类效率,另一方面,可以降低分类器的复杂性,降低分类错误率。
     本论文针对特征选择过程中,算法复杂度高以及最佳特征量个数难以确定的问题,提出一种改进的最大相关最小冗余的特征选择算法,基于特征子集评价准则实现最佳特征量的选择。
     在基于最大相关最小冗余特征选择算法理论研究基础上,结合特征相关性与冗余性,提出了改进的最大相关最小冗余的wrapper型特征选择算法。该算法充分考虑了特征量的相关程度与冗余程度在特征选择中的不同作用,加入了平衡特征相关性和冗余性的权重因子。通过UCI数据集进行实验验证,表明该算法可以有效去除无关冗余特征,且对特征空间潜在的冗余程度进行有效度量和选取,降维的同时也提高了分类精度。
Feature selection is a key problem in pattern recognition field which selects optimal feature subset from the original characteristics. For example, in the field of bioinformatics, a more efficient feature selection algorithm is especially important in small samples of high-dimensional data for gene expression and protein mass spectrometry data.
     Feature selection is necessary in designing a good classifier performance. For example, the computational complexity and training time of support vector machine classifiers is a non-linear change with the number of training samples and input space dimension. Therefore, preprocessing on the training set is an important way to improve the performance of SVM. Choosing reasonable and effective features and reducing dimension appropriately can eliminate redundancy and accelerate the running speed to improve the classification efficiency, on the other hand, it can reduce the complexity of the classifier and error rate.
     In order to solve the problems of algorithm complexity and determine the number of best characteristics in feature selection process, the paper proposes an improvement wrapper type feature selection algorithms. It can achieve the best characteristic quantities measure choice based on guidelines of feature subset.
     The paper combined with feature relevance and redundancy proposes an improvement wrapper type feature selection algorithms based on the maximum relevance minimum redundancy feature selection algorithm.The algorithm considers the different effect of relevance and redundancy degree in feature selection, and pulls in a weighting factor to balance relevance and redundancy characteristics. The paper uses the UCI data sets to verify the validity of algorithm, the results show that the algorithm can effectively remove unrelated redundant features, and measures potential redundancy feature space effectively, while also reduces dimension and improves the classification accuracy.
引文
1 Cover T M. The best two independent measurement are not the two best. IEEE Trans, System Man Cybernetic, 1974,15(2):116-117
    2 E. Backer, J. A. Shipper. On the max-min approach for feature ordering and selection, Proc. Seminaron Pattern Recognition, Liege, 1977,46(3):251-261
    3 SIEDLECKI W, SKLANSKY J. A note genetic algorithm for large-scale feature selection. Pattern Recognition Letters, 1989,10(13):335-347
    4 Y. Yang, J. O. Pedersen, A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, 1997,26(1):15-39
    5 Langley P. Selection of relevant features in machine learning. Proceedings of the AAAI fall symposium on relevance. Menlo Park, CA: AAAI Press, 1994,51(7):140-144
    6 KIRA K, RENDELL L. A practical approach to feature selection. Proceedings of the Ninth International Conference on Machine Learning, 1992,38(2):249-256
    7 IGOR KONONENKO. Estimating Attributes:Analysis and Extensions of Relief. Proceeding of the European Conference on Machine Learning, 1994,25(23):171-182
    8 George H J, Ron K, Karl P. Irrelevant Features and the Subset Selection Problem. Machine Leaming in : Proceedings of the llth International Conference, 1994,11(5):121-129
    9 Dash M, Liu H. Feature selection for classifications. Intelligent Data Analysis: An International Journal, 1997,16(21):131-156
    10刘文军,李洪兴.一种求粗糙集中最小属性约简的新算法.北京师范大学学报, 2004,56(1):8-12
    11王春迎,郝士琦.基于结构自适应神经网络特征选择的一种改进方法.电光与控制, 2005,34(5):32-35
    12凌锦江,陈兆乾,周志华.基于特征选择的神经网络集成方法.复旦大学学报, 2004,5(3):685-688
    13王国胤,于洪,杨大春.基于条件信息摘的决策表约简叨.计算机学报, 2002,7(16):759-766
    14 WESTON J, MUKHERJEE S, CHAPELLE O. Feature Selection for SVMs. Advances in neural information processing systems, Cambridge, MA: MIT Press, 2000,13(7): 668-674
    15 JASON WESTON, MIKE TIPPING. Use of the Zero-Norm with Linear Models and Kemel Methods. Journal of Machine Learning Research, 2003,3(19):1439-1461
    16 JENSEN R, SHEN Qiang. Fuzzy-rough attribute reduction with application to Web categorization. Fuzzy Sets and Systems, 2004,141(3):469-485
    17 JENSEN R, SHEN Q. Fuzzy-rough sets assisted attribute reduction. IEEE Trans on Fuzzy System s, 2007,15(1):73-89
    18 BHATTR B, GOPALM. On fuzzy-rough sets approach to feature selection. Pattern Recognition Letters, 2005,26(7):965-975
    19 HU Qing-hua, YU Da-ren, XIE Zong-xia. Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recognition Letters, 2006,27(5):414-423
    20 ALAIN RAKOTOMARNONJY. Variable selection Using SVM-based Criteria. Journal of Machine learning Research, 2003,3(35):1357-1370
    21张莉,孙钢,郭军.基于K-均值聚类的无监督的特征选择方法.计算机应用研究, 2005,3(24):23-24
    22 YUAN H, et al. A two-phase feature selection methods using both filter and wrapper. In: IEEE SMC’99 Conf. Proc. 1999,25(2):132-136
    23王嘉驹.复杂数据的特征选择与关联分析. [上海交通大学硕士学位论文]. 2005:23-29
    24 S B Serpico, L Bruzzone. A New Search Algorithm for Feature Selection in Hyper Spectral Remote Sensing Images. IEEE Trans on Geoscience and Remote Sensing, 2001,39(7):1360-1367
    25谭湘莹,于秀兰,钱国惠.一种大小窗口结合的SAR图像纹理特征分类方法.系统工程与电子技术, 2000,22(4):15-17
    26 Jain A, Zongker D. Feature selection: Evaluation application and small sampleperformance. IEEE Transactions 0n Pattern Analysis and Machine Intelligence, 1997,19(24):153-158
    27 Langley P, Iba W. Average-case analysis of a nearest neighbour algorithm. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence. 1993, 2(3):889-894
    28 Jain A K, Duin R, Mao J C. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000,22(1):4-37
    29祝翠玲.基于无监督聚类和朴素贝叶斯分类的文本分类方法研究. [山东大学硕士论文]. 2002:32-34
    30 Das S. Wrappers and a boosting-based hybrid for feature selection. Proceedings of the Eighteenth Internationa1 Conference on Machine Learning, 2001,10(21):74-81
    31 Zhu Z, Ong Y S, Dash M. Wrapper-Filter leature selection algorithm using amemetic frame work. IEEE Transactions on Systems, Man and Cybernetics, PartB, 2007,37(1):70-76
    32 D. Koller and M. Sahami. Toward optimal feature selection. Proceedings of International Conference on Machine Learning, 1996,(33):237-250
    33 P. M. Narendra and K. Fuktmaga, A branch and bound algorithm for feature selection, IEEE Transactions on Computers, 1977,26(9):917-922
    34 K Kira, L. A. Rendell, The foature selection problem: Traditional methods and a new algorithm. Proceedings of Ninth National Conference on Artificial Intelligence. 1992,15(33):129-134
    35 T Lutz, P Funk, Jakobi A. Cortsideratiorts on Laminar Flow for a Stratospheric Airship Platfo. D-70550 Stuttgart, Germany, 2003,51(7):1196-1206
    36 Lucidi S, Piccialli V. New Classes of Globally Convexized Filled Functions for Globla Optimization. Journal of Global Optimization, 2002,24(1):219-236
    37 Macro Dongo and L M Gambardella. Ant Colonies for the Travelling Salesman Problem. BioSystems, 1997,43(3):73-81
    38孙伟艳.模式分类中特征选择问题的研究. [哈尔滨理工大学硕士论文]. 2009:23-26
    39毛勇,周晓波,夏铮.特征选择算法研究综述.模式识别与人工智能,2007,20(2):211-218
    40苏映雪.特征选择算法研究. [国防科技大学硕士论文]. 2006:16-32
    41王小平,曹立明.遗传算法.西安:西安交通大学出版社, 2002
    42 S. Y. Wang, K. ai. Structural topology design optimization using GeneticAl-gorithms with a bit-array representation. Computer Methods in Applied Mechanics and Engineering, 2005,24(194):3749-3770
    43高晓静.基于模拟退火算法的航迹规划方法研究.微电子学与计算机, 2000,69(5):10-14
    44 Hong L, Wan Y F, Anil Jain. Finger print Image Enhancement: Algorithm and Performance Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998,14(8):77-80
    45 Agrawla R, Faloutsos C, Swami A. Eficient similarity search in sequence database. Proceedings of the 4th Internationla Cofnerence on Foundations of Data Organization and Algoirthm. NewYork: Spirnger, 1993,51(4):528-589
    46 MolinaL C, Belanche L, Nebot A. Feature selection algorithms: A survey and experimental evaluation. Proceedings of IEEE Internationa Conference on Data M ining, 2002,5(1):306-313
    47 Chakraborty B. Genetic algorithm with fuzzy fitness function for feature selection. Proceedings of the 2002 IEEE International Symp on Industria1 Electronics, 2002,1(8):315-319
    48 Liu H, Setiono R. A probabilistic approach to feature selection: A filter solution. Proceedings of International Conference on Machine Learning, 1996,4(4):319-327
    49 Oh I S, Lee J S, Moon B R. Hybrid genetic algorithm for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004,26(11):1424-1437
    50 Ran Kohavi, George H John. Wrappers forfeature subset selection. Artificial Intelligence, 1997,97(20):273-324
    51吴必栋,斯颂乐.热学.上海:上海科学技术出版社, 1983
    52王彬.熵与信息.西安:西北工业大学出版社, 1994
    53 C. E. Shannon, A mathematical theory of communication. The Bell System TechnicalJournal, 1948,27(7):379-423
    54 Michel Cherlbuliez. Content Based Image Querying, Unibersity of Geneva Dissertation. 1997,16(3):24-43
    55姜春艳,赵鹏起.信息理论及其在分类学中的应用,佳木斯大学学报, 2002,20(4):405-408
    56罗欣,郭雷.多元互信息在超光谱图像自动配准中的应用.计算机工程与应用, 2006,3(7):5-l0
    57边肇祺,张学工.模式识别.北京:清华大学出版社, 1999
    58 Jiawei Han and Michelins Kamber. Data Mining Concepts and Techniques. Higher Education Press, Morgan Kaufmann Publishers, 2001,3(1):1-48
    59 Tom M. Mitchell著,曾华军,张银奎等译.机器学习.北京:机械工业出版社, 2003
    60 Macro Ramoni, Paola Sebastiani. Learning Bayesian Networks from complete Database. Knowledge Media Institute, 1997,13(1):143-159
    61 D. Tao, X. Li, X. Wu, S. J. Maybank, General tensor discriminant analysis and gabor features for gait recognition, IEEE Transactions Pattem Analysis and Machine Intelligence, 2007,29(10):1700-1715
    62 Blum A L, Langley P. Selection of relevant features and examples in machine learning. Artificial Intelligence, 1997,97(2):245-271

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700