基因表达谱数据特征选择算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
基因芯片技术是一种新型的分子生物学技术,也是一项具有深远影响的重大科学成就。基因芯片技术可以快速准确的生成大量的微阵列基因表达谱数据,使人们从分子水平上了解基因的表达模式和微观层面上研究生命现象。由于基因表达谱数据具有样本小、维数高、噪声大、冗余基因大量存在、分布不平衡等特点。所以,采用合适的方法降低特征维数,选择具有代表性的特征基因就成为一项重要的数据处理的工作。
     针对基因表达谱数据样本小,分布不平衡,噪声大,不符合正态模型的特点。提出了两种基于稳健统计思想的估计量,这两种统计量即考虑了样本总体的信息,但同时避免过于依赖对正态模型的假设。实验表明,将这些统计量应用到T统计量算法选择差异表达基因的问题中,获得了较好的分类效果。
     支持向量机技术是一种基于结构风险最小化的分类技术,L-J算法是一种通过研究支持向量机分类提出的特征选择算法。根据K-L变换理论,任意向量都可以写成正交空间中投影到各个坐标的分量的和的形式。因此,改进后的L-J算法用分类超平面的梯度向量在各坐标轴的分量取代了梯度向量与各坐标轴的夹角计算,同时能获得与L-J算法相同的效果。
     针对基因表达谱数据含有大量冗余基因,冗余基因的存在影响分类效果的问题,提出了一种基于相关系数的方法将基因表达谱数据的每个基因映射成为特征空间的向量,然后在特征空间将映射后的向量按某种规则聚类,聚类完成后,在每一子类中选取一个代表向量组成特征子集。实验表明,该算法降低了特征维数,提高了分类效果。
     遗传算法是一种智能化的大规模搜索算法。本文在充分考虑基因表达谱数据的特点的基础上,提出了一种应用于特征选择的改进型遗传算法。该算法将遗传算法,免疫算法,过滤法,启发式方法,支持向量机分类技术相结合,获得了较小的,分类能力较强的特征子集。
The gene microarray technology is a new molecular biological technology with great influence. Gene microarray makes it feasible to obtain large number of gene expression data so that people understand gene expression patterns from the molecular level and study biological phenomena in the micro perspective. But the dataset has some traits, such as small samples, high dimensionality, big noise, large number of redundant genes, uneven distribution. It is an important preprocessing technique to choose an appropriate method to reduce the feature dimension and choose the representative genes.
     Gene expression data is small, uneven distribution, noisy and does not meet the normal distribution. This paper proposes two estimators based on theory of robust statistics. The two statistics do not only take the information of overall sample into account, but also avoid over-dependence on the normal model assumptions. The experiments show that it obtain a better classification accuracy when these estimators are applied to the T-statistic algorithm to select differentially expressed genes.
     Support vector machine is a classification technology based on structural risk minimization. L-J algorithm is feature selection algorithm based on research SVM classification.According to K-L transform theory, any vector can be expressed as the sum of component in orthogonal space. Therefore, the improved algorithm use separating hyperplane of the gradient vector’s components in each axis instead of the angle calculation between gradient vector and each axis.The method can obtain the same effort with L-J algorithm.
     Gene expression data contains a lot of redundancy genes.A large number of redundant genes affects the classification results. The paper proposed a method mapping each gene into feature space’s vector based on correlation coefficients theory and cluster the vector according to certain rules.After that step, We Select a representative subset from vector composition and compose feature subset.Experiment show that the algorithm reduces the feature dimension and improve the classification results.
     Genetic algorithm is an intelligent search algorithm for large data sets. This paper proposes an improved genetic algorithm applied to feature selection based on full consideration to the characteristics of gene expression data.The algorithm mix genetic algorithm, immune algorithm, filtering, heuristic method and support vector machine classification. The obtained feature subset through this algorithm has stronger classification ability.
引文
[1]郭新红,姜新成等.基因芯片技术与基因表达谱研究[J].生物学杂志,2001,18(5):1-3.
    [2] Richard Durbin. Biological sequence analysis:probabilistic models of proteins and nucleic[M]. Cambridge University Press,1998,2-4.
    [3] Dmitrij Frishman,Alfonso Valencia.Modern genome annotation:the BioSapiens Network[M]. Thomson Press Ltd.,Chennai,India,1-6.
    [4] Igor F,Tsigelny.Protein structure prediction:bioinformatic approach[M].International University Line,2002,1-3.
    [5] Nicholas H.Bergman.Comparative genomics Volume 2[M].Humans Press Inc,2007,17-20.
    [6] Giovanni Parmigian.The analysis of gene expression data: methods and software[M]. Springer-Verlag New York,Inc,2003,1-4.
    [7] Brown P.O.,Botstein D.,Exploring The New World of The Genome with DNA Microarrays[J]. Nat Genet,1999,21(1),Suppl:33-37.
    [8]杨畅,方福德.因芯片数据分析[J].国医学科学院,中国协和医科大学基础医学院研究所,医学分子生物学国家重点实验室,生命科学,16(01):4-51.
    [9]马立人.生物芯片[M].北京化学工业出版社,2000.
    [10]喻红霞,胡建达.基因芯片的应用及其数据分析方法[J].福建医科大学学报,2005(2). 235-237.
    [11] Yvan Saeys,Inaki Inza and Pedro Larranaga. A review of feature selection techniques in bioinformatics[J]. Bioinformatics,2007(23),2507-2517.
    [12] Dudoit S,Fridlyand J,Speed T P. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data[J]. Journal of the American Statistical Association,2002,97(457):77-87.
    [13] Adam B L,Vlahou A,Semmes J,et a1.Proteomic Approaches to Biomarker Discovery in Prostate and Bladder Cancers[J]. Proteomics,2001,l(10):1264-1270.
    [14] Sun Z H,Bebis G,Miller R. Object Detection Using Feature Subset Selection[J]. Pattern Recognition,2004,37(11):2165-2176.
    [15] Jain A K,Duin R D W,Mao J C. Statistical Pattern Recognition:A Review[J]. IEEE Trans on Pattern Analysis and Machine Intelligence,2000,22(1):4-37.
    [16] Kudo M,Sklansky J. Comparison of Algorithms That Select Features for Pattern Classifiers[J]. Pattern Recognition,2000,33(1):25-41.
    [17] Chen Xuewen. An Improved Branch and Bound Algorithm for Feature Selection[J]. Pattern Recognition Letters,2003,24(12):1925-1933.
    [18] Fukunaga K,Narendra P M. A Branch and Bound Algorithm for Computing k-Nearest Neighbors[J]. IEEE Trans on Computers,1975,24(7):750-753.
    [19] Hamamoto Y,Uchimura S,Matsuura Y,et a1. Evaluation of the Branch and Bound Algorithm for Feature Selection[J]. Pattern Recognition Letters,1990,11(7):453-456.
    [20]边肇祺,张学工.模式识别第2版[M].北京:清华大学出版社,2000. 186-187+221-223+136.
    [21]FurlaneIlo C,Serafini M,Merler S,eld,. An Accelerated Procedure for Recursive Feature Ranking on Microarray Data[J]. Neural Networks,2003,16(5/6):641-648.
    [22] Somol P,Pudil P,Novovieovfi J,eta1. Adaptive Floating Search Methods in Feature Selection[J]. Pattern Recognition Letters,1999,20(11/12/13):1157-1163.
    [23] Pudil P,Novovicova J,Kittler J. Floating Search Methods in Feature Selection[J]. Pattern Recognition Letters,1994,15(11):1119—1125.
    [24]Wang Ling. Intelligent Optimization Algorithms with Applications[M]. Beijing,China:Tsinghua University Press,2004.
    [25] Tsymbal A,Puuronen S. Ensemble Feature Selection with the Simple Bayesian Classification[J]. Information Fusion,2003,4(2):87-l00.
    [26] wu B L,Abbott T,Fishman D,et a1. Comparison of Statistical Methods for Classification of Ovarian Cancer Using Mass Spectrometry Data[J]. Bioinformatics,2003,19(13):1636—1643.
    [27] Peng Sihua,Xu Qianghua,Ling Xuefeng. Molecular Classification of Cancer Types from Microarray Data Using the Combination of Genetic Algorithms and Support Vector Machines[J]. FEBS Letters,2003,555(2):358-362.
    [28] Inza I,Larranaga P,Blanco R,et a1. Filter Versus Wrapper Gene Selection Approaches in DNA Microarray Domains[J].Artificial Intelligence in Medicine,2004,3l(2):9l-103.
    [29] Zhou Xiaobo,Wang Xiaodong,Dougherty E R. Nonlinear Probit Gene Classification Using Mutual-Information and Wavelet-Based Feature Selection[J]. Biological Systems,2004,12(3):371-386.
    [30] Furey T S,Cristianini N,Duffy N,et a1. Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data.Bioinformatics. 2000,l 6(10):906—914.
    [31] Zhou Xiaobo,Wang Xiaodong,Dougherty E R.Construction of Genomic Networks Using Mutual Information Clustering and Reversible-Jump Markov Chain Monte Carlo Predictor Design[J]. Signal Processing,2003,83(4):745—761.
    [32] Zhou Xiaobo,Wang Xiaodong,Dougherty E R. Gene Selection Using Logistic Regressions Based on AIC,BIC and MDL Criteria[J]. Journal of New Mathematics and Natural Computation,2005,1(1):129-145.
    [33] Sindhwani V,Rakshit S,Deodhare D,et a1. Feature Selection in MLPs and SVMs Based on Maximum Output Information[J]. IEEE Trans on Neural Networks,2004,15(4):937—948.
    [34] Haering N,Lobo N D V. Feature and Classification Methods to Locate Deciduous Trees in Images[J]. Computer Vision and Image Understanding,1999,75(1/2):133-149.
    [35] Hsu W H. Genetic Wrappers for Feature Selection in Decision Tree Induction and Variable Ordering in Bayesian Network Structure Learning[J]. Information Sciences,2004,163(1/2/3):103-122.
    [36] Chiang L H,Pell R J. Genetic Algorithms Combined with Discriminant Analysis for Key Variable Identification[J]. Journal of Process Control,2004,14(2):143一155.
    [37] Tabus I,Astola J. On the Use of MDL Principle in Gene Expression Prediction[J]. EURASIP Journal of Applied Signal Processing,2001,4:297-303.
    [38] Li L,Weinberg C R,Darden T A,et a1. Gene Selection for Sample Classification Based on Gene Expression Data:Study of Sensitivity to Choice of Parameters of the GA/KNN Method[J].Bioinformatics,2001,17(12):1131-1142.
    [39] Xiong Momiao,Fang Xiangzhong,Zhao Jinying.Biomarker Identifieation by Feature Wrappers[J]. Genome Research,200l,11(11):1878-1887.
    [40] Guyon I,Weston J,Barnhilt S,et a1. Gene Selection for Caneer Classification Using Support Vector Machines.Machine Learning,2002,46(1/2/3):389—422.
    [41] Verikas A,Bacauskiene M. Feature Selection with Neural Networks[J]. Pattern Recognition Letters,2002,23(11):1323—1335.
    [42] Weston J,Mukherjee S,Chapelle O,et a1. Feature Selection for SVMs[J] . Advances in Neural Information Processing Systems. Cambridge,USA:MIT Press,2001,13:668—674.
    [43]孙宪华,郭亚帆.居民实际支出指标的稳健性分析[J].天津师范大学学报. 2005(1):23-24.
    [44]郭亚帆.稳健统计以及几种统计量的稳健性比较分析[J].统计研究. 2007,9(24),82-85.
    [45] Colin Campbell. Algorithmic Approaches to Training Support Vector Machines:A Survey. Department of Engineering Mathematics,Bristol University,Bristol BS8 1TR,United Kingdom.
    [46]谭泗桥.支持向量回归机的改进及其在植物保护中的应用[D].湖南:湖南农业学,2008:4-7.
    [47] Cristianini,Shawe Taylor. An Introduction to Support Vector Machines[M]. Electronic Industry Press. 2005.
    [48]章毓晋.图像工程(中册)图像分析(第二版)[M].北京:清华大学出版社,2005:91-95.
    [49]张恒,冯子亮.一种基于粗糙集合理论的彩色图像分割[J].计算机技术与发展. 2009(19),2,39-41+44.
    [50]杨纶标,高英仪.模糊数学原理及应用[M].华南理工大学出版社,广州,2008,1-2.
    [51] Joseph Lee Rodgers,W.Alan Nicewander. Thirteen Ways to Look at the Correlation Coefficient.The American Statistician[J].1988,42(1):59-66.
    [52]宋铭利,高新科.基于距离的最大聚类数探索算法的探讨[J].矿山机械,2006,9(34),106-107.
    [53]闫建峰,刘明,李伟华.故障诊断中的冗余特征处理[J].计算机测量与控制.2008,16(6),777-780.
    [54]余国清,邓锐,雷刚跃.格雷码对遗传算法的改进研究[J].中国科教创新导刊. 2007(481).
    [55]唐飞,滕弘飞.一种改进的遗传算法及其在布局优化中的应用[J] .软件学报,1999,10(10):1096-1102.
    [56]刘勇,康立山,陈毓屏.非数值并行算法-遗传算法[M].北京:科学出版社,1998.
    [57]叶晨洲,杨杰,黄欣等.实数编码遗传算法的缺陷分析及其改进.计算机集成制造系统CIMS. 2001,7(5):28-32,41.
    [58]何新贵,梁久祯.利用目标函数梯度的遗传算法[J].软件学报,2001,12(7)::981-985.
    [59]张思才,张方晓.一种遗传算法适应度函数的改进方法[J].计算机应用与软件,2006,23(2):108-110.
    [60] Potts C J,Terri D.The development and evaluation of an improved genetic algorithm based on migration and artificial selection[J]. IEEE Trans on Systems,Man,and Cybemetics,1994,24(1):73-86.
    [61] Davisl. Adaptive operator probability in genetic algorithms[C].Proc of the 3rd International Conference on Genetic Algorithms. San Francisco:Morgan Kaufmann Publishers,1989:61-69.
    [62] Whitley D,Starkweather D. Gene-tic II:a distributed genetic algorithms[ J]. Journal of Experimental & Theoretical Artificial Intelligence,1990,7(7):189-214.
    [63]张良杰,毛志宏,李衍达.遗传算法中突变算子的数学分析及改进策略[J].电子科学学刊,1996,18( 6):590-595.
    [64] Liu Li, Chen Xue-yun. Reconfiguration of distribution networks based on fuzzy genetic algorithms[J]. Proceedings of the Chinese Society of Electrical Engineering,2000,20(2):66-69.
    [65]任江涛,黄焕宇,孙婧昊等.基于遗传算法及聚类的基因表达谱数据特征选择[J].计算机科学. 2006,33(9).155-156+224.
    [66]吴艳文,胡学钢,陈效军.基于Relief算法的特征学习聚类[J].合肥学院学报,2008,18(2),45-48.
    [67] Wang Lei,ourant M. Multiuser detection based on the immune strategy RBF network[C]. ICOIVIP’02,2002:1485-1489.
    [68] Endoh S,Toma N,Yamada K. Immune algorithm for n-TSP[C]. In:Proceedings of IEEE International Conference on Systems Man and Cybernetics,1998,4:3844-3849.
    [69] Zuo Xingquan,Li Shiyong,Ban Xiaojun. An immunity-based optimization algorithm for tuning neuro-fuzzy controller[C]. The Second International Conference on Machine Learning and Cybernetics. Xi’an,China,November,2003:666-671.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700