基于多特征的集成分类器在基因表达数据分类中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着人类基因组计划的发展,DNA微阵列技术作为一项革命性的技术应运而生。它可以自动、快速、高效的检测成千上万个基因的表达情况,通过分析所产生的基因表达数据,可以在分子层面了解细胞的生理状态,如生存、增殖、分化、凋亡、癌变和应激等等。这些问题对于医学临床诊断、药物疗效判断、解释疾病发生机制等方面有重要的作用。
     基因表达数据数目巨大且极其复杂,人们通过医学影像学的方法很难直接对其做出解释。因此,基因表达数据分类成为了生物信息学领域中一个十分困难的问题。早期,人们常常使用模式识别的方法,借助计算机的强大计算能力对其进行分类,取得了一些成果。最近几年,随着机器学习算法在生物信息学领域的应用日益广泛,机器学习的算法作为一种新兴的解决问题的方法被不少学者提出,用于基因表达数据分类。但遗憾的是,由于基因表达数据特有的样本少、特征多、非线性的特点,直接使用机器学习的方法还存在着一定的困难。这主要是因为:1.过多的特征使得重要特征被众多无关特征掩盖,使得分类器难以学习。2.样本数目过少,使得大部分分类器出现过拟合现象。为了解决特征众多的问题,往往通过对原始数据进行特征基因抽取以达到降维的目的;对于样本少的问题,常常采用分类器集成的方法来增强单个分类器的学习能力,从而提高分类的准确率。
     对于一个优秀的基因表达数据分类系统而言,特征基因的选择和分类器的集成是必不可少的两个步骤。然而,这两个步骤在实际应用往往是孤立进行的,前一个步骤并不能很好的为下一步奠定一个良好的基础,甚至有可能降低整体系统的分类准确率。
     本文通过总结前人常用方法的优缺点,将特征基因的选择与分类器的集成有机的结合起来,提出了基于多特征的集成分类器方法。其算法思想如下:该方法首先使用不同的特征基因提取算法如相关性分析,Golub方法,t检验方法等对数据进行特征提取,得到样本的多个特征子集。然后通过可重复采样技术,在不同的特征子集中抽取样本形成训练子集。由于训练子集是在不同的特征子集中抽取的,所以具有更大的差异性。而后使用一组神经网络学习这组特定的训练子集,为了保证神经网络不陷入局部最优,训练采用粒子群优化算法(PSO)。最后,基于“Many could be better than all”的选择性集成思想,使用分布估计算法(EDA)选取最优的神经网络分类器进行集成,做出最后的分类判决。
     为了验证方法的有效性,实验采用了国际通用的基因表达数据集Leukemia、Colon、Ovarian、Lung Cancer进行分类实验。结果表明,使用本文提出的方法比其他方法具有更高的分类准确率和稳定性。
Along with the development of the Human Genome Program, the DNA microarray technology arises as a revolutionary technology at the time. It can detect tens of thousands of gene expression data automatically, rapidly and efficiently. Through analysis of the gene expression data, we can understand the physiological state of cells at the molecular level, such as survival, proliferation, differentiation, apoptosis, canceration, irritability and so on. These issues play an important role in medical diagnosis, drug efficacy judgment and disease explanation.
     Gene Expression data is very complex and the number is enormous. It is very difficult to be explained through medical imaging method directly. Thus, gene expression data classification has become one of the toughest questions in the field of bioinformatics. In the early time, the pattern recognition methods have often been employed and achieved some results with the help of the strong power of computers. In recent years, as machine learning algorithms are widely used in the field of bioinformatics, these methods are proposed for gene expression data classification as a new way. However, due to the few samples, the excessive features and nonlinear of the gene expression data, there are some difficulties to apply these methods directly. This is manly because: 1. important features are covered up by the excessive unrelated features and they are hard to be learnt by the classifiers. 2. Too few samples make the classifier over-fitted. In order to solve the first problem, feature selection methods have often been applied to reduce the dimensions. For the second problem, classifier ensembles have usually been used in order to increase the classification accuracy.
     For an excellent gene expression data classification system, the genetic feature selection and classification ensembles are the two essential steps. However, these two steps are often isolated in practical applications. The previous steps would not provide a good foundation for the next steps, and even reduce the overall classification accuracy.
     In this paper, a novel ensemble of classifiers based on multi features has been proposed. This method combines the genetic feature selection and classifier ensembles. The algorithm is expressed as follows: Firstly, in order to extract useful features and reduce dimensionality, different feature selection methods such as correlation analysis, Fisher-ratio is used to form different feature subsets. Then a pool of candidate base classifiers is generated to learn the subsets which are re-sampling from the different feature subsets with PSO (Particle Swarm Optimization) algorithm. At last, by the selective ensemble’s idea of“many could be better than all”, appropriate classifiers are selected to construct the classification committee using EDA (Estimation of Distribution Algorithms).
     Four common datasets namely Leukemia, Colon, Ovarian and Lung Cancer have been applied in order to test this method. Experiments show that our proposed method gives the higher classification accuracy and stability than the other methods.
引文
[1] Schmidt U, Begley C. Cancer diagnosis and microarrays. Int J Biochem Cell Biol, 2003, 35(2): 119-124
    [2] Lu Y, Han J. Cancer classification using gene expression data. Inform Syst, 2003, 28(4): 243-268
    [3] Sarkar I, Planet P, Bael T, Stanley S, et al. Characteristic attributes in cancer microarrays. J. Biomed Inform, 2002, 35(2): 111-122
    [4] Kuo W, Kim E, Trimarchi J, et al. A primer on gene expression and microarrays formachine learning researchers. J Biomed Inform, 2004, 37(4): 293-303
    [5] Azuaje F. A computational neural approach to support the discovery of gene function and classes of cancer. IEEE Trans Biomed Eng, 2001, 48(3): 332-339
    [6] Khan J, Wei J, Ringner M, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, 2001, 7(6): 673-679
    [7] Albrecht A, Vinterbo S, Ohno-Machado L. An epicurean learning approach to gene-expression data classification. Artif Intell Med, 2003, 28(1): 75-87
    [8] Huang C, Liao W. Application of probabilistic neural networks to the class prediction of leukemia and embryonal tumor of central nervous system. Neural Process Lett, 2004, 19(3): 211-26
    [9] Pan F, Wang B, Hu X, et al. Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J Biomed Inform, 2004, 37(4): 240-248
    [10] Ding C, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 2001, 17(4): 349-258
    [11] Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci, 2001, 98(26): 15149-15154
    [12]李春涛.基于ART神经网络的基因分析: [硕士学位论文].沈阳:东北大学, 2004
    [13] Li L, Weinberg C, Darden T, Pedersen L. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001, 17(12): 1131-1142
    [14] Deutsch J. Evolutionary algorithms for finding optimal gene sets in microarray prediction. Bioinformatics, 2003, 19(1): 45-52
    [15] Karzynski M, Mateos A, Herrero J, et al. Using a genetic algorithm and a perceptron for featureselection and supervised class learning in DNA microarray data. Artif Intell Rev, 2003, 20 (1/2): 39-51
    [16] Langdon W, Buxton B. Genetic programming for mining DNA chip data for cancer patients. Genet Programm Evol, 2004, 5(3): 251-257
    [17] Valentini G. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artif Intell Med, 2002, 26(3): 281-304
    [18] Park C, Cho S B. Evolutionary computation for optimal ensemble classifier in lymphoma cancer classification. In: Lecture notes in artificial intelligence (ISMIS), vol. 2871; 2003, 521-530
    [19] Tan A, Gilbert D. Ensemble machine learning on gene expression data for cancer classification. Appl Bioinform, 2003, 2 (Suppl 3): 75-83
    [20] Thomas J G, Olson J M, Tapscott S J, et al. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res, 2001, 11, 1227–1236
    [21] Tsai C A, Chen Y J, Chen J J. Testing for differentially expressed genes with microarray data. Nucl. Acids Res, 2003, 21(1): 31-52
    [22] Hwang D, Schmitt W A, Stephanopoulos G, et al. Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics, 2002, 18, 1184–1193
    [23] Antonov A V, Tetko I V, Mader M T, et al. Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics, 2004, 20, 644–652
    [24] Schena M, Shalon D, Davis R W, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 2005, 270: 467–470
    [25] Lipshutz R J, Fodero S P A, Gingeras T R, et al. High density synthetic oligonucleotide arrays. Nature genetics, 1999, 21(1): 20-24
    [26] Duggon D J, Bitter M, Chen Y, et al. Expression profilling using cDNA microarray. Nature Genetics Supplement, 1999, 21: 10-14
    [27] Stekel D. Microarray bioinformatics. Cambridge: Cambridge University Press, 2003
    [28] Chu F, Wang L. Appliations of Support Vector Machines to Cancer Classification with Microarray Data. International Journal of Neural Systems, 2005, 15(6): 475-484
    [29] Turkey J W. The collected works of John W Turkey, volume 3, chapter philosophy and principles of data analysis. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1986
    [30] Aidong Zhang. Advanced analysis of gene expression microarray data. Singapore: World Scientific Press, 2006
    [31] Golub T R, Slonim D K, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999, 286: 531-537
    [32] Zhao Y, Chen Y, Zhang X, A Novel Ensemble Approach for Cancer Data Classification. In: Proc of ISNN'07. Nanjing, 2007
    [33] Roth V, Lange T. Bayesian class discovery in microarray datasets. IEEE Trans Biomed Eng, 2004, 51(5): 707-818
    [34] Zhou X, Liu K, Wong S. Cancer classification and prediction using logistic regression with Bayesian gene selection. J Biomed Inform, 2004, 37(4): 249-259
    [35] Li L, Weinberg C, Darden T, et al. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 2001, 17(12): 1131-1142
    [36] Camp N, Slattery M. Classification tree analysis: a statistical tool to investigate risk factor interactions with an example for colon cancer. Cancer Causes Contr, 2002, 13(9): 813-823
    [37] Zhang H, Yu C, Singer B. Cell and tumor classification using gene expression data: construction of forests. Proc Natl Acad Sci, 2003, 100(7): 4168-4172
    [38]刘小虎,李生.决策树的优化算法.软件学报, 1998, 9(10): 797-800
    [39]肖勇,陈意云.用遗传算法构造决策树.计算机研究与发展, 1998, 35(1): 46-52
    [40]项婧,任劼.决策树分类器在分析基因微阵列数据中的应用.计算机工程与设计, 2006, 27(15): 2905-2908
    [41]张绍武,潘泉,张洪才等.基于支持向量机的多类蛋白质折叠子预测.西北工业大学学报, 2004, 22(2): 120-135
    [42]张绍武,潘泉,陈润生等.基于支持向量机的蛋白质同源寡聚体分类研究.生物化学与生物物理进展, 2003, 30(6): 100-110
    [43]周鹏.支持向量机在DNA微阵列数据分析中的应用研究.计算机工程与设计, 2005, 26(11): 2966-2968
    [44]吴骋,王志勇,贺佳等. SVMs在微阵列表达数据中的应用.数理统计与管理, 2005, 25(4): 121-126
    [45]阎平凡,张长水.人工神经网络与模拟进化计算.北京:清华大学出版社, 2005
    [46]周春光,梁艳春.计算智能.长春:吉林大学出版社, 2001.11
    [47]孙即祥等.现代模式识别.长沙:国防科技大学出版社, 2001.5
    [48]黄席樾,张著洪,何传江等.现代智能算法理论及应用.北京:科学出版社, 2005
    [49]周鹏.神经网络集成算法研究及在基因表达数据分析中的应用.武汉:华中科技大学, 2004.
    [50]范明,范宏建等(译).数据挖掘导论.北京:人民邮电出版社, 2006.5
    [51] Hong J H, Cho S B, The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming. Artificial Intelligence in Medicine, 2006, 36, 43-58
    [52] Peng Y, A novel ensemble machine learning for robust microarray data classification, Computers in Biology and Medicine, 2006, 36, 553-573
    [53] Holland J H. Adoption in Natural and Artificial Systems. Ann Arbor: University of Michigan, 1975
    [54] Kennedy J, Eberhart R. Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, 1995
    [55]周志华,陈世福.神经网络集成.计算机学报, 2002, 25(1): 1-8
    [56]唐伟,周志华.基于Bagging的选择性聚类集成.软件学报, 2005, 16(4): 496-502
    [57] Larranaga P, Lozano J A. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Norwell: Kluwer Academic Publishers, 2001
    [58]周树德,孙增圻.分布估计算法.自动化学报, 2007, 33(2), 113-124
    [59] Alon U, Barkai N, Notterman D A, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc. Natl. Acad. Sci, 1999, 96(12), 6745–6750
    [60] Petricoin E, Ardekani A, Hitt B, et al. Use of proteomic patterns in serum to identify ovarian cancer. The Lancet, 2002, 359(9306), 572–577
    [61] Gordon G, Jensen R, Hsiao L, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res, 2002, 62(17), 4963–4967
    [62] Furey T S, Cristianini N, Duffy N, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2000, 16, 906–914
    [63] Yu L, Liu H. Redundancy based feature selection for microarray data. Department of Computer Science and Engineering Arizona State University: Technical Report, 2004
    [64] Wang Z Y, Palade V, Xu Y. Nero-fuzzy Ensemble approach for microarray cancer gene expressiondata analysis. In: Proceedings of the 2006 International Symposium on Evolving Fuzzy Systems (IEEE), 2006, 7–9, 241–246
    [65] Jirapech-Umpai T, Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. Bioinformatics, 2005, 6, 168–174
    [66] Tan A C, Naiman D Q, Xu L, et al. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics, 2005, 21(20), 3896–3904
    [67] Hong J H, Cho S B. The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming. Artif. Intell. Med, 2006, 36, 43–58

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700