生物数据特征提取方法及应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着高通量技术的飞速发展,大量研究结果产生了海量的生物医学数据。如何从海量的生物医学数据发掘有生物意义的知识和规律是后基因时代人类所面临最具挑战性的生物学问题之一。序列数据飞速增长,而大量参与重要生命活动的基因和蛋白质功能仍然未知。由于生物数据本身的复杂性及不同研究领域存在的不同研究评价准则,人们很难仅从数据本身出发去发现基因和蛋白质的功能信息,因而人们开始通过特征特征提取方式来对生物信息数据中所存在的规律进行挖掘。
     生物数据的特征提取是生物信息学中最为基本的问题,特征提取算法的优劣直接关系到生物数据信息提取和分析的准确性。本文立足于基因数据和蛋白质数据,围绕基因数据和蛋白质数据的特征提取进行深入研究,根据相应数据自身的特点及其应用背景,提出了三种不同的特征提取算法,并在标准数据集上对方法的准确性、可靠性进行验证及分析。本文主要工作概括如下:
     (1)蛋白质特征提取是蛋白质相关应用问题的基础,特征提取的不完整是影响蛋白质特征有效提取的主要因素之一。针对该问题本文提出一种基于混合特征的序列特征提取方法。该方法主要是通过利用一些蛋白质序列特征信息构造出一个向量,并以此作为蛋白质的特征向量。基于该方法本文将该特征向量作为SVM或KNN分类器的输入来预测出蛋白质进行亚细胞的准确定位。通过跟其他的一些基于序列信息的蛋白质亚细胞定位方法比较,该方法能够在没有预先知道蛋白质结构知识的情况下自动地对蛋白质亚细胞定位进行预测。从实验结果和时间分析上可以看出本文所提方法在准确度上要优于其他的一些方法,说明了这种方法的正确性和有效性。
     (2)蛋白质特征提取方法中,研究人员大多偏重于局部信息的提取,这使得所构造的特征仍然不够完整。针对该问题本文提出一种序列数字特征提取方法,该方法忽略了蛋白质的结构和相互作用信息,基于疏水性,极性,电荷性等特性构造出一个向量并以此作为蛋白质的特征向量。该方法获得的特征既包含了蛋白质序列全局信息,又囊括了序列局部信息。基于该方法本文提取蛋白质序列的特征向量并结合最近邻分类算法(KNN)预测蛋白质的功能分类,以解决没有或者其相互作用信息很少的蛋白质功能类预测问题。为了讨论亚细胞定位信息是否对蛋白质功能预测有影响的问题,本文将亚细胞位置信息融入所提特征中,并将其用于蛋白质功能预测,实验显示其效果在某些方面优于其他方法,这也证实了所提方法的有效性。
     (3)基因表达数据具有高通量、高维、非线性、高噪声以及分布不均的特点,这直接影响了基因数据所含信息的有效提取。本文针对基因表达数据的特点提出了一种新的特征基因选择算法。该方法同时考虑了过滤法和缠绕法在特征选择中的应用,在对原始数据过滤后引入KNN方法对每一条基因进行聚类,然后引入聚类紧密度指标来进一步降低特征基因的维数;考虑到基因与基因之间的相互作用,本文在特征提取过程中引入一种新的特征基因搜索策略。该方法所选择特征基因在具有很好的识别精度的同时也具有较好的冗余。本文将该特征基因选择方法应用于肿瘤亚型识别试验以及关键SNP的选择实验中。结果表明,本章所提出的方法可获得很好的实验效果。
With the rapid development of high-throughput technologies, a flood of biomedical data has come into being. One of the most challenging biological problems we are facing in the post-genome era is how to excavate significance biological knowledge and law from massive biomedical data. With the booming of sequence data, the function of genes and proteins involved in important life activities still remains unknown. It is difficult to discover functional information of genes and proteins from the data itself due to the complexity of the biological data and the difference of evaluation criteria existed in different research areas. And thus people began to mine the rule of bioinformatics data by means of feature extraction. Feature extraction is the most fundamental problems in bioinformatics, and the quality of feature extraction algorithm is directly related to the accuracy of information extraction and analysis of biological data. Based on gene and protein data, features extracted from gene and protein data are explored in more depth in this paper. According to the characteristics of the data itself and its application background, we propose three different feature extraction algorithms and meanwhile verify the accuracy and reliability of the methods. This paper is summarized as follows:
     (1) Protein feature extraction is the basis of the protein associated application problems, feature extraction is one of the effective extraction of the main factors affecting protein characterized incomplete. This paper presents a mixed feature-based sequence features extraction method for the problem. The methods are to construct a vector through the use of some protein sequence feature information, formed as a protein characteristic vector, based on the method of the feature vector as the input of SVM or KNN classifier to predict the exact localization of the protein subcellular, in this article. By comparison with other protein subcellular localization method based on sequence information, the method can automatically in the case did not know the knowledge of protein structure in advance. From the analysis of the experimental results the proposed method is superior to other methods in accuracy, correctness and effectiveness of this approach.
     (2) Still not enough protein feature extraction methods for most of the researchers extracted emphasis on local information, which makes the tectonic characteristics. A series digital feature extraction method is proposed in this paper for this problem, the method to ignore the protein's structure and interaction information, to construct a vector based on the hydrophobicity, polarity, charge and other characteristics as the feature vector of the protein. Obtained by this method characterized in protein sequence contains both the global information, and encompasses the sequence of local information. This article extract protein sequence feature vectors and combined with nearest neighbor classifier (KNN) algorithm to predict protein function classification, to address the protein functional class prediction problem no or little interaction information. In order to discuss whether the subcellular localization of information issues that affect protein function prediction subcellular location information into the mentioned features, and for protein function prediction experiments show the effect in some respects superior to other methods. It also confirms the effectiveness of the proposed method.
     (3) This paper presents a new feature gene selection algorithm, and applied to the tumor subtype recognition. This method taking into accounts both filtration and wrap method of feature selection. First of all filter method is used to reduce gene cluster dimension then clustering tightness indicators is introduced to further reduce the feature genes, While the interaction between genes is taking into account, making this feature gene subset of features with low redundancy but highly classified information, not only with high recognition accuracy but also low redundancy. SVM is used as classifier, our experiments is based on four gene expression data sets commonly used in the international. The results show that the method presented in this chapter is superior to some other methods.
引文
[1]Teresa KA, The babel of bioinformatics. Science,2000,290(5491):471-473.
    [2]Venter, J. C, M.D.Adams, et al. The Sequence of the Human Genome. Science,2001,291(5507):1304-1351
    [3]Kyoda K, Morohashi M, Onamni S, et al. A gene network inference method from continuous-valus gene expression data of wild-type and mutants. Genome Informatics,2000,11(1):196-204
    [4]Li C X, Li X, Guo Z, et al. Analysis of the mRNA expression similarity of genes in the same gene expression regulatory. Hereditas,2004,26(6):929-933.
    [5]Stuart K. The large scale structure and dynamics of gene control circuits:An ensemble approach. Journal of Theoretical Biology,1974,44(1):167-190
    [6]Tyers M, MannM. Fromgenomies to proteomies. Nature,2003(422):.193-197
    [7]Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature,2003(422): 198-207
    [8]张振慧.蛋白质分类问题的特征提取算法研究:[国防科学技术大学博士学位论文].长沙:国防科学技术大学应用数学系,2006,1-2
    [9]Feng Z P. Anoverview oll predieting the subeellular location of a protein. In Silico Biology,2002,13(2):291-303
    [10]Brown P, Botstein D. Exploring the new world of the genome with DNA microarrays. nature genetics,1999,21(1):33-37
    [11]Peri S, J. D. Navarro. Development of Human Protein Reference Database as an Initial Platform for Approaching Systems Biology in Humans. Genome Research,2003,13(10):2363-2371
    [12]Ren, B., F. Robert, et al. Genome-Wide Location and Function of DNA Binding Proteins.Science,2000,290(5500):2306-2309
    [13]Dai J, Lieu L, Rocke D. Dimension reduction for classification with gene expression microarray data. Statistical Applications in Genetics and Molecular, 2006,5(1);5-6
    [14]Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics,2007,23(19):2507-2517
    [15]Anne L B, Korbinian S. Partial least squares:a versatile tool for the analysis of high-dimensional genomic data. Brief in Bioinformatics,2007,8(1):32-44
    [16]Antoniadis A, Lambert L S.Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics,2003,19(5):563-570
    [17]Dash M, Liu H.Feature selection for classification Intelligent Data Analysis. Intelligent Data Analysis,1997,1997(1):131-156
    [18]Shuangge Ma. Penalized feature selection and classification in bioinformatics.Briefings in bioinformatics,2008,(5):392-403.
    [19]Borgwardt K M,Cheng SO,Stefan Schonauer, et al.Protein function prediction via graph kernels. Bioinformatics,2005(21):47-56,
    [20]Stephen F A, Madden T L. PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Research,2009,37(3):815-824
    [21]Kim J K, Bang S Y. Sequence-driven features for prediction of subcellular localization of proteins, Pattern Recogn,2006,39(1):2301-2311.
    [22]Cai Y D, Chou K C. Nearest neighbor algorithm for predicting protein subcellur location by combining functional domain composition and psudo-amino acid co mposition. Biochemical and Biophysical Research G:communication,2003,305 (1):407-411
    [23]Lee B J, Shin M S, Oh Y J, etal, Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Sci, 2009,7(1):27-46
    [24]张松,黄波.蛋白质亚细胞定位的生物信息学研究.生物化学与生物物理学进展,2007,34(6):573-579
    [25]Feng Z P.Prediction of the subcellular location of prokaryotic protein based on a new representation of the amino acid composition. Biopolymers,2001, 58(1):491-499
    [26]IC Nakai, HoRon'PSORT P. A program for detecting sorting signals in proteins and predicting their subcellular localization,Trends Biochem. Sci.,1999,24(1):34-36
    [27]Emanuclsson,Nielsen H,Brunk S etal. Predicting subceilular localization of proteins based on their N-terminal amino acids sequences,J. Mol. Biol,2000, 300(1):1005-1016.
    [28]Bendtsen D..Improved prediction of signal Izptides:SignalP 3.0. Journal of Molecular Biology,2004,340(4):783-795.
    [29]高青斌.蛋白质亚细胞定位预测相关问题研究:[国防科技大学博士学位论文].长沙:国防科技大学控制科学与工程系,2006,5-9
    [30]Chon K C, Cai Y D. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition, Journal of Cellular Biochemistry,2003,90(1):1250-1260.
    [31]Chon K C.Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes.Bioinformatics,2005,21(1):10-19.
    [32]Xiao X, Shao S, Ding Y, etal.Using complexity measure factor to predict protein subeellularlocation,Amino Acids,2005,28(1):57-61.
    [33]Golub T R, Slonim D K, Tamayo P et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 1999,286(5439):531-537.
    [34]Jose B, Draper B. Feature selection from huge feature sets. IEEE Conf on Computer,2001,8(2):159-164.
    [35]蒋玉娇,王晓丹,王文军等.一种基于PCA和ReliefF的特征选择方法.计算机工程与应用,2010,46(26):170-172.
    [36]任江涛,孙婧昊,黄焕宇等.一种基于信息增益及遗传算法的特征选择算法.计算机科学,2006,33(10):193-196.
    [37]李颖新,阮晓钢.肿瘤基因表达谱分类特征基因选取问题及分析方法研究.计算机学报,2006,29(2):324-330.
    [38]张丽,基于启发式聚类的混合特征基因选择方法研究:[湖南大学硕士学位论文].长沙:湖南大学信息科学与工程学院,2010,6-11
    [39]朱云华,李颖新,阮晓钢.基于基因表达谱的SRBCT分类研究.计算机工程与应用,2005,41(1):221-223
    [40]王树林,王戟,陈火旺等.肿瘤信息基因启发式宽度优先搜索算法研究.计算机学报,2008,31(4):636-649.
    [41]Su Y, Murali T M, Pavlovic V, et al. RankGene:identification of diagnostic genes based on expression data. Bioinformatics,2003,19(12):1578-1579.
    [42]Zhu Z, Ong Y S, Dash M. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition,2007,40(11):3236-3248.
    [43]Liu H W, Liu L, Zhang H J. Ensemble gene selection by grouping for microarray data classification. Journal of Biomedical Informatics,2010,43(1):81-87.
    [44]Li J,Wavelet-based feature extraction for improved endmember abundance estimation in linear unmixing of hyperspectral signals, IEEE,2004,42(3): 644-649.
    [45]Liu Y H. Wavelet feature extraction for high-dimensional microarray data. Neurocomputing,2009,72(4):985-990.
    [46]Tang Y C, Zhang Y Q, Huang Z, Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2007,4(3): 365-381.
    [47]Li L, Weinberg C R, Darden T A, et al. Gene selection for sample classification based on gene expression data:study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics,2001,17(12):1131-1142.
    [48]Pudil P, Novovicova J, Kittler J. Floating search methods in feature selection. Pattern Recognition Letters,1994,15(11):1119-1125.
    [49]Liu B, Li S, Wang Y, et al. Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochemical and Biophysical Research Communications,2007,358(1):136-139.
    [50]Zhang Q, Zhang Y S. Hierarchical clustering of gene expression profiles with graphics hardware acceleration. Pattern Recognition Letters,2006,27(6): 676-681.
    [51]Huerta E B, Duval B, Hao J K. A hybrid GA/SVM approach for gene selection and classification of microarray data. Springer-Verlag Berlin Heidelberg,2006, 3907(1):34-44.
    [52]Chuang L Y, Ke C H, Chang H W, et al. A two-stage feature selection method for gene expression data. Omics,2009,13(2):127-137.
    [53]Peng Y D, Wu Z Q, Jiang J M. A novel feature selection approach for biomedical data classification. Journal of Biomedical Informatics,2009,43(1):15-23.
    [54]叶奇明,罗飞,刘娟.基于多目标EDA的特征基因选择.计算机应用与研究,2009,26(8):2891-2894.
    [55]Kaufman L, Rousseeuw P J. Finding Groups in Data:An Introduction to Cluster Analysis. Wiley, New York,1990.
    [56]Kohonen T. Self-organized for mation of topologically Correct feature maps. Biological Cybernetics,1982,43(1):59-69.
    [57]Herrero J, Valencia A, Dopazo J. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics,2001,17(2): 126-136.
    [58]曹晖,席斌,米红.一种新聚类算法在基因表达数据分析中的应用.计算机工程与应用,2007,43(18):234-238.
    [59]Huang W L, Tung C W, Huang H L,et al. ProLoc:Prediction of protein subnuclear localization using SVM with automatic selection from physiochemical composition features. Biosystems,2007,90(2):573-581.
    [60]王继峰.生物化学.北京:中国中医药出版社,2003,8-12.
    [61]钱小红,贺福初等译.蛋白质组学:从序列到功能.北京:科学出版社,2003,5-15.
    [62]Li E M, Li Q Z. Using peseudo amino acid composition to predict protein subnuclear location with improVed hybrid approach. Amino Acid,2008,34(1): 119-125.
    [63]Dong Q, Zhou S, Deng L, et al. Gene ontology-based protein function prediction by using sequence composition information. Protein Pept Lett,2010,17(6): 789-795.
    [64]Altschul S F, Adden T L, Schaffer A A, et al. Gapped Blast and PSI-BLAST:a new generation of protein database search programs. Nucleic Acids Res, 1997(17),25:3389-3402.
    [65]Nakashima H, Nishikawa K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol, 1994,238(1):54-61.
    [66]Chou K C. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins,2001,43(3):246-255.
    [67]Huang Y, Li Y. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics,2004,20(1):21-28.
    [68]Bhasin M, Garg A, Raghava G P S. PSLpred:prediction of subcellular localization of bacterial proteins. Bioinformatics,2005,21(10):2522-2524.
    [69]Xiao X, Wang P, Chou K C. Predicting the quaternary structure attribute of a protein by hybridizing functional domain composition and pseudo amino acid composition. Journal of Applied Crystallography,2009,42(1):169-173.
    [70]A. Reinhardt, T. Hubbard.Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research,1998,26(9):2230-2236
    [71]Zhang T, Ding Y, Chou KC.Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence.Computational Biology and Chemistry,2006,30(5):367-371
    [72]Mei S, Fei W, Zhou S.Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics,2011,12(1),44-55
    [73]Kenta Nakai. Protein Sorting Signals and Prediction of Subcellular Localization. Advances in Protein Chemistry,2000,54(1):277-344
    [74]Kenta Nakai, Minoru Kanehisa. A knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells. Genomics,1992,14(4):897-911
    [75]Chou KC, D. Elrod. Protein Subcellular Location Prediction. Protein Eng,1999, 12(2):107-118
    [76]Chou KC. Prediction of Protein Cellular Attributes Using Pseudo Amino Acid Composition. Proteins,2001,43(3):246-255
    [77]Keun Joon Park, Minoru Kanehisa. Prediction of Protein Subcellular Locations by Support Vector Machines Using Compositions of Amino Acids and Amino Acid Pairs. Bioinformatics,2003,19(13):1656-1663
    [78]Alla Bulashevska, Roland Eils. Predicting Protein Subcellular Locations Using Hierarchical Ensemble of Bayesian Classifiers Based on Markov Chains. BMC Bioinformatics,2006,7(1):298-300
    [79]Chou KC, Cai YD.Predicting Subcellular Localization of Proteins by Hybridizing Functional Domain Composition and Pseudo-amino Acid Composition. Journal of Cellular Biochemistry,2004,91(6):1197-1203
    [80]Chen YL, Li QZ.Prediction of Apoptosis Protein Subcellular Location Using Improved Hybrid Approach and Pseudo Amino Acid Composition. Journal of Theoretical Biology,2007,248(2):377-381
    [81]Panek J, Eidhammer I, Aasland R. A New Method for Identification of Protein (Sub) Families in a Set of Proteins Based on Hydropathy Distribution in Proteins. Proteins,2005,58(4):923-934
    [82]Zhang L, Liao B. A novel representation for apoptosis protein subcellular localization prediction using support vector machine. J Theor Biol,2009,259(2): 361-365
    [83]Liao B, Jiang JB, Zeng QG, et al. A new approach for apoptosis protein subcellular location prediction. Protein & Peptide Letters,2011,18(11):1086-1092
    [84]Ding YS, Zhang TL. Using Chou's Pseudo Amino Acid Composition to Predict Subcellular Localization of Apoptosis Proteins:An Approach with Immune Genetic Algorithm-Based Ensemble Classifier. Pattern Recognition Letters, 2008,29(13):1887-1892
    [85]Song CH, Sh F, Ma X. Prediction of the Subcellular Location of Apoptosis Proteins Based on Approximate Entropy. JCIT:Journal of Convergence Information Technology,2009,4(4):111-115
    [86]Lin H, Wang H, Ding H, et al. Prediction of Subcellular Localization of Apoptosis Protein Using Chou's Pseudo Amino Acid Composition. Acta Biotheor,2009,57(3):321-330
    [87]Song CH, Shi F. Prediction of Subcellular Localization of Apoptosis Proteins by Dipeptide Composition. JDCTA:International Journal of Digital Content Technology and its Applications,2010,4(1):32-36.
    [88]Liao B, Liao BY, Sun XM, et al. A Novel Method for Similarity Analysis and Protein Subcellular Localization Prediction. Bioinformatics,2010,26 (21) 2678-2683
    [89]Han LY, Cai C Z, Ji Z L, et al. Predicting functional family of novel enzymes irrespective of sequence similarity:a statistical learning approach. Nucleic Acids Res,2004,32(21):6437-6444
    [90]Kawabata T, Nishikawa K. MATRAS:A program for protein 3D structure comparison. Nucleic Acids Res,2003,31(13):3367-3369
    [91]Eidhammer I, Jonassen I, Taylor W R. Structure comparison and structure patterns. J. Comput. Bio,2000,7(5):685-716
    [92]Vazquez A, Flammini A, Maritan A, et al. Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol,2003,21(6):697-700
    [93]Pugalenthi G, Kumar K K, Suganthan P N, et al. Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. Biochem. Bioph. Res. Co,2008,367(3):630-634
    [94]Chen Y C, Lin Y S, LinCJ, et al. Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins,2004,55(4):1036-1042
    [95]Jensen L J, Gupta R, Blom N, et al. Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol,2002, 319(5):1257-1265
    [96]Halperin I, Glazer D S, Wu S, et al.The FEATURE framework for protein function annotation:modelling new functions, improving performance, and extending to novel applications. BMC Genomics,2008,9(Suppl 2):S2
    [97]Pandey G, Kumar V, Steinbach M. Computational Approaches for Protein Function Prediction:A Survey. Wiley Book Series On Bioinformatics,2007
    [98]Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool. J. Mol. Biol.1990,215(3):403-410
    [99]Pearson W R, Lipman D J. Improved tools for biological sequence comparison. Proc.Natl. Acad. Sci. USA,1988,85(8):2444-2448
    [100]Gao Q B, Wang Z Z, Yan C, et al. Prediction of protein subcellular location using a combined feature of sequence. FEBS Letters,2005,579(16):3444-3448
    [101]Du P, He T, Li Y. Prediction of C-to-U RNA editing sites in higher plant mitochondria using only nucleotide sequence features. Biochem. Bioph. Res. Co, 2007,358(1):336-341
    [102]Li X, Liao B, Shu Y, et al. Protein functional class prediction usingglobal encoding of amino acid sequence. J. Theor. Biol,2009,261(2):290-293
    [103]Liao B, Liu Q, Zeng Q, et al. An approach for data selection of protein function prediction. MATCH Commun. Math. Comput. Chem,2011,65(2):459-468
    [104]刘惠.蛋白质序列数据的分类预测研究:[上海交通大学博士学位论文].上海:上海交通大学,2007,4-40
    [105]Cai C Z, Wang W L, Sun L Z, et al. Protein function classification via support vector machine approach. Math. Biosci,2003,185(2):111-122
    [106]Chou K C, Shen H B. Recent progress in protein subcellular location prediction. Anal Biochem,2007,370(1):1-16
    [107]A. Buness, M. Ruschhaupt, R. Kuner. Classification across gene expression microarray studies. BMC Bioinformatics,2009,10(21):447-453.
    [108]J. Hua, W. D. Tembe, E. R. Dougherty. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recogn.,2009,42(3) 409-424.
    [109]W. Zidong. Stochastic Dynamic Modeling of Short Gene Expression Time-Series Data. NanoBioscience, IEEE Transactions on,2008,7(25):44-55,
    [110]L. Bing, C. Wan, W. Lipo. An efficient semi-unsupervised gene selection method via spectral biclustering. NanoBioscience, IEEE Transactions on,2006,5(71) 110-114.
    [111]Y. Zhiwen W. Hau-San. Class Discovery From Gene Expression Data Based on Perturbation and Cluster Ensemble. NanoBioscience, IEEE Transactions on, 2009,8(11):147-160.
    [112]A. Dupuy R. M. Simon. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst.2007,99(56):147-157.
    [113]C. Sima E. R. Dougherty. What should be expected from feature selection in small-sample settings, Bioinformatics.2006,22(18):2430-2436.
    [114]D. V. Nguyen D. M. Rocke. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics.2002,18(6):39-50.
    [115]C. Liao, S. Li, Z. Luo. Gene selection using Wilcoxon rank sum test and support vector machine for cancer. Lecture Notes in Computer Science,2007,44(56): 57-66
    [116]J. Biesiada, W. Duch. Feature selection for high-dimensional data-a Pearson redundancy based filter. Advances in Soft Computing,2008,45(12):242-249.
    [117]L. Rocchi, L. Chiari, A. Cappello. Feature selection of stabilometric parameters based on principal component analysis. Medical and Biological Engineering and Computing,2004,42(24):71-79.
    [118]R. Kohavi, G.H.John. Wrappers for Feature Subset Selection. Artificial Intelligence,1997,97(15):273-324.
    [119]Iffat A.Gheyas, LeslieS.Smith. Feature subset selection in large dimensionality domains. Pattern Recognition,2010,43(9):5-13
    [120]Yukyee Leung, Yeungsam Hung. A Multiple-Filter-multiple-Wrapper approach to gene selection and microarray data classificaton. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2010,7(1):23-30.
    [121]P. Mitra, C. A. Murthy, S. K. Pal. Unsupervised Feature Selection Using Feature Similarity. IEEE Trans. Pattern Anal. Mach. Intell.,2002,24(18):301-312.
    [122]Y. Zhu, W. Pan, X. Shen. Support Vector Machines with Disease-gene-centric Network Penalty for High Dimensional Microarray Data. Stat Interface,2009, 2(34):257-269.
    [123]Kai-Bo, D., J. C. Rajapakse. Multiple SVM-RFE for gene selection in cancer classification withexpressiondata.IEEETransactionson NanoBioscience,2005(32): 228-234.
    [124]Li, F, Y. Yang. Analysis of recursive gene selection approaches from microarray data. Bioinformatics,2005,21(12):3741-3747.
    [125]Bontempi, G. A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2007.4(9):293-300.
    [126]de Souto, M., I. Costa. Clustering cancer gene expression data:a comparative study. Bmc Bioinformatics,2008,1(15):497-509.
    [127]Ding, C, H. Peng. Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Proceedings of the IEEE Computer Society Conference on Bioinformatics,2003, IEEE Computer Society:523-511.
    [128]Ding, C, H. Peng. MINIMUM REDUNDANCY FEATURE SELECTION FROM MICROARRAY GENE EXPRESSION DATA. Journal of Bioinformatics & Computational Biology.2005,3(2):185-205.
    [129]Liang, J., S. Yang. Invariant optimal feature selection:A distance discriminant and feature ranking based solution. Pattern Recognition,2008,41(5):1429-1439.
    [130]J. L. Jesneck, S. Mukherjee, Z. Yurkovetsky, et al. Do serum biomarkers really measure breast cancer?. BMC Cancer,2009,9(3):164-170.
    [131]X. Liang, R. C. Chen, X. Guo. Pruning support vector machines without altering performances. IEEE Trans Neural Netw,2008,19(35):1792-1803.
    [132]U. Alon, N. Barkai, D. A. Notterman, et al, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A,1999,96:6745-6750.
    [133]S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature,2002, 415(25):436-442.
    [134]D. Singh, P. G. Febbo, K. Ross, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell,2002,1(54):203-209.
    [135]Braga-Neto, U. M, E. R. Dougherty. Is cross-validation valid for small-sample microarray classification? Bioinformatics,2004,20(5):374-380.
    [136]Hanchuan, P. Feature Selection Based on Mutual Information:Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. Ieee Transactions on Pattern Analysis and Machine Intelligence.2005,27(13):1226-1238.
    [137]Jun Zhang,Ancestral. Informative Marker Selection and PopulationStructure Visualization UsingSparse Laplacian Eigenfunctions. Plos One,2010,5(11):1-12

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700