基于流形学习的肿瘤基因表达数据分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
肿瘤是影响人类健康的主要疾病之一,然而目前的肿瘤诊断方法和治疗效果都不是很理想。基于基因表达谱的肿瘤分子诊断方法是一种全新的快速而准确的诊断方法,它还能检测肿瘤的进展、恶化程度以及抗癌药物的耐药性等,为临床医生诊断肿瘤分型、提供治疗方案以及分析预后提供一种重要参考。目前,具有“高维,小样本”特征的微阵列数据不断积累,如何有效地从这些高维数据中获取有用信息或规律已成为当今信息科学与技术所亟待解决问题之一。
     从基因表达谱的成千上万个基因中选择分类能力强,数量少的特征基因极具复杂性。通常情况下,在如此大的基因空间中进行穷尽搜索是不可能的。因此选择合适的特征提取方法是非常重要的。
     本论文中,我们在总结流形学习算法成果的基础上,我们应用一种新的特征提取方法和一些流形学习算法对两类和多类分类问题进行了研究比较,最后我们利用CMVM(Constrained Maximum Variance Mapping, CMVM)和局部线性判别嵌入算法(locally linear discriminant embedding, LLDE)算法对跨平台肿瘤数据进行了分类比较研究。
     本文主要作了以下研究工作:第一,把一种肿瘤基因表达数据特征提取方法——基于约束最大差异投影的特征提取方法(Constrained Maximum Variance Mapping:CMVM)应用于肿瘤样本基因特征提取,然后我们用K-NN分类器进行分类:在两类分类实验中,我们对前列腺癌数据集和乳腺癌数据集进行了特征提取及识别率的分析;在多类分类实验中,我们对白血病数据集和中枢神经系统肿瘤数据集进行了特征提取及识别率的分析。通过对不同的肿瘤样本基因特征提取及识别率的分析实验验证了该方法的可行性和有效性。第二,把流形学习算法用于跨平台肿瘤样本基因表达数据的特征提取,然后用K-NN分类器进行分类,从而比较它们的识别效果。
     本文最后指出了目前肿瘤基因表达数据特征提取及分类研究存在的一些问题以及今后需进一步开展的研究工作。
Tumor is one of major diseases that affecting human health. However, at present, tumor diagnosis and treatments need to be improved. Compared with conventional method, the molecule diagnosis method based on gene expression profiles is more accurate. It can detect the progression and deteriorating degree of the tumor or the tolerance of the anti-cancer drug and so on, which can offer the clinical doctors an important reference for diagnosing the tumor type, providing treatment programs and analyzing prognosis. At present, the microarray data with the charecteristics of high dimension and small sample continues to accumulate. How to obtain useful information or law from these high-dimensional datas effectively has become one of the problems needed to be solved urgently in the field of information science and technology.
     However, it's very difficult to select the feature genes which have a good classification capability and small quantity from thousands of genes in the gene expression profile. Usually, it is impossible to apply an confined search in such a large gene space. So it's very important to select a suitable feature extraction method.
     In this thesis, we applied a new feature extraction method using manifold learning algorithm. Then we make a research and comparison among the two-class or multi-class classification problems by the method and some manifold learning algorithm. Lastly, we conduct a study and ategory-comparison on the cross-platform tumor data by CMVM (Constrained Maximum Variance Mapping) and LLDE (locally linear discriminant embedding) algorithms.
     The main researches of this thesis are described as follows:Firstly, we applied a method of picking up the tumor gene expression data——a feature extraction method named as Constrained Maximum Variance Mapping (CMVM) into extracting tumor samples genes feature. Then we made a classification by K-NN classifier. In the two-class classification experiments, we performed a feature extraction and recognition rate analysis to the prostate cancer dataset and the breast cancer dataset. In the multi-class classification experiments, we performed a feature extraction and recognition rate analysis to the Leukemia dataset and the central nervous system tumors dataset. We confirmed the feasibility and the effectiveness of the method through the feature extraction and recognition rate analysis experiments of different tumor samples genes. Secondly, we applied the manifold learning algorithm to the feature extraction of cross-platform tumor samples gene expression data. Then we classified them by K-NN classifier for comparing their recognition effect.
     Finally, this paper pointed out that there were still some existing problems about the present tumor gene expression data feature extraction and classification, and a further research still needed to be done in the future.
引文
[1]Fodor S.P., Rava R.P., Huang X.C., Pease A.C., Holmes C.P., and Adams C.L.:Ultiplexed biochemical assays with biological chips. Nature, (364):555-556,1993.
    [2]Schena M., Shalon D., Davis R.W., and Brown P.O.:Quantitative monitoring of gene expression patterns with a complementary microarray. Science, (270):467-470,1995.
    [3]DeRisi J.L., Iyer V.R., and Brown P. O.:Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, (278):280-286,1997.
    [4]Iyer V.R.:The transcription program in the response of human fibroblasts to serum. Science, (283):83-87,1999.
    [5]Li W.:Introduction to bioinformation. Zhengzhou:Zhengzhou University Publishing House, 2004.
    [6]Timothy G, Alok J.S., Cora A.S., Eric S.L., and Gerald R.F.:Ploidy regulation of gene expression. Science, (285):251-254,1999.
    [7]Berns A.:Gene expression in diagnosis. Nature, (403):491-492,2000.
    [8]West M.:Bayesian factor regression models in the'large p, small n'paradigm. Bayesian Statistics, (7):723-732,2003.
    [9]温民能:生物芯片的市场解构[J].生物技术世界:1-10,2005.
    [10]Tenenbaum J.B., De Silva V., and Langford J.C.:A global geometric framework for nonlinear dimensionality reduction. Science,290:2319-2323,2000.
    [11]Roweis S.T., and Saul L.K.:Nonlinear dimensionality reduction by locally linear embedding. Science,290(5500):2323-+,2000.
    [12]Saul L.K., and Roweis S.T.:Think globally, fit locally:Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research,4(2):119-155,2004.
    [13]Yan S.C., Xu D., and Zhang B.Y., et al.:Graph embedding and extensions:A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence,29(1):40-51,2007.
    [14]Chang H., and Yeung D.Y.:Robust locally linear embedding. Pattern Recognition, 39(6):1053-1065,2006.
    [15]Choi H., and Choi S.:Kernel isomap. Electronics Letters,40(25):1612-1613,2004.
    [16]Choi H., and Choi S.:Robust kernel isomap. Pattern Recognition,40(3):853-862,2007.
    [17]孙明明:流形学习理论与算法研究[D].南京理工大学博士学位论文,2007.
    [18]Shi R., Shen I. F., and Chen W. et al:Manifold learning for image denoising. In Proceedings of the fifth International Conference on Computer and Information Technology: 596-602,2005.
    [19]Cox T.F., and Cox M.A.A.:Multi-dimensional scaling. Chapman and Hall:London,1994.
    [20]Vandenberghe L., and Boy S.P.:Semidefinite programming. SIAMReview, Vol.38, No.1: 49-95,1996.
    [21]Bai X.M., Yin B.C., Shi Q., and Sun Y.F.:Face recognition based on supervised locally linear embedding method, Journal of Information & Computation Science 4:641-646, 2005.
    [22]Zhu M., and Martinez A.M.:Subclass discriminant analysis. IEEE transactions on pattern analysis and machine intelligence 28(8):1274-1286,2006.
    [23]Dudoit S., Fridyland J.F., and Speed T.P.:Comparison of discrimination methods for the classification of tumor using gene expression data. Journal of the American statistical association 97:77-87,2002.
    [24]Singh D., Febbo P.G., Ross K., Jackson D.G., Manola J., Ladd C., Tamayo P., Renshaw A.A., D'amico A.V., and Richie J.P., et al.:Gene expression correlates of clinical prostate cancer behavior. Cancer Cell,1:203-209,2002.
    [25]Van't Veer L.J., Dai H., Van De Vijver M.J., He YD., Hart A.A.M., Mao M., Peterse H.L Van Der Kooy K., and Marton M.J., et al.:Gene expression profiling predicts clinical outcome of breast cancer. Nature,415:530-536,2002.
    [26]Kouropteva O., Okun O., and Pietikainen M.:Supervised locally linear embedding algorithm for pattern recognition. LNCS 2652:386-394,2003.
    [27]Ridder D., and Duin R.P.W.:Locally linear embedding for classification. Technical Report PH-2002-01, Pattern Recognition Group, Department of Imaging Science and Technology, Delft University of Technology, Delft, The Netherlands,2002.
    [28]Ridder D., Kouropteva O., and Okun O., et.al.:Supervised locally linear embedding. Artificial Neural Networks and Neural Information Processing, ICANN/ICONIP Proceedings, Lecture Notes in Computer Science 2714, Springer:333-341,2003.
    [29]Pillati M., and Viroli C.:Supervised locally linear embedding for classification:an application-to gene expression data analysis. Proceedings of 29th Annual Conference of the of the German Classification Society (GfKl 2005):15-18,2005.
    [30]Brunet J.P., Tamayo P., Golun T.R., and Mesirov J.P.:Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA,101(12):4164-416,2004.
    [31]Pomeroy S.L., and Tamayo P., et al.:Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature,415:436-442,2002.
    [32]Alter O., Brown P.O., and Botstein D.:Processing and modeling genome-wide expression data using singular value decomposition, Progress in biomedical optics and imaging 2(23):171-186,2001.
    [33]Cordero F., Botta M., and Calogero R.A.:Microarray data analysis and mining approaches. Briefings in Functional Genomics and Proteomics,6(4):265-281,2007.
    [34]Furey T.S., Cristianini N., Duffy N., Bednarski D.W., Schummer M., and Haussler D.: Support vector machines classification and validation of cancer tissue samples using microarray expression data. Bioinformatics,16:906-914,2000.
    [35]Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., and Lander E.S.:Molecular classification of cancer:class discovery and class prediction by gene expression monitoring. Science 1999,286:531-537,1999.
    [36]Kim K.Y., Ki D.H., Jeung H.C., Chung H.C., and Rha S.Y.:Improving the prediction accuracy in classification using the combined data sets by ranks of gene expressions. BMC Bioinformatic,9:283,2008.
    [37]Zhu M.L., and Martinez A.M.:Using the information embedded in the testing sample to break the limits caused by the small sample size in microarray-based classification. BMC Bioinformatics 2008,9:280,2008.
    [38]Yeoh E.J., Ross M.E., Shurtleff S.A., Williams W.K., Patel D., Mahfouz R., Behm F.G, Raimondi S.C., Relling M.V., and Patel A., et al:Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling Cancer Cell,1:133143,2002.
    [39]Gordon G.J., Jensen R.V., Hsiao L.L., Gullans S.R., Blumenstock J.E., Ramaswamy S., Richards W.G., Sugarbaker D.J., and Bueno R:Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma cancer res,.62:49634967,2002.
    [40]Armstrong S.A., Staunton J.E., Silverman L.B., Pieters R., denBoer M.L., Minden M.D., Sallan S.E., Lander E.S., Golub T.R., Korsmever S.J.:MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Nature Genet,.30:4147,2002.
    [41]Bae K., and Mallick B.K.:Gene selection using a two-level hierarchical bayesian model. Bioinformatics,20:3423-3430,2004.
    [42]Lee K.E., Sha N., Dougherty E.R., Vannucci M., and Mallick B.K.:Gene selection:a bayesian variable selection approach. Bioinformatics,19:90-97,2003.
    [43]Liao J.G., and Chin K.V.:Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics,23(15):1945-1951,2007.
    [44]Parvlidis P., Weston J., Cai J., and Grundy W.N.:Gene functional analysis from heterogeneous data. In Proceedings of 5th International Conference on Computational Biology, Pittsburgh, PA,249-255,2001.
    [45]Zhang J.P., Shen H.X., and Zhou Z.H.:Unified locally linear embedding and linear discriminant analysis algorithm for face recognition. Advances in Biometric Personal Authentication. Stan Z. Li, Jianhuang Lai, Tieniu Tan, Guo can Feng, Yunhon (Ed.) LNCS 3338, Springer-Verlag:209-307,2004.
    [46]Nguyen D.V., and Rocke D.M.:Tumor classification by partial least squares using microarray gene expression data. Bioinformatics,18(1):39-50,2002.
    [47 Belkin M., Niyogi P.:Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems,14:585-591,2002.
    [48]Li B., Zheng C.H., Huang D.S., and Zhang L.:Gene expression data classification using locally linear discriminant embedding.
    [49]Li B., Huang D.S., Wang C., and Liu K.H.:Feature extraction using constrained maximum variance mapping. ICIC, LNCS 4111,3287-3294,2008.
    [50]Kokiopoulou E., and Saad Y:Orthogonal neighborhood perserving projections. Proceedings of the Fifth IEEE international Conference on Data Mining:1-7,2005.
    [51]Pablo Tamayo, Daniel Scanfeld, Benjamin L. Ebert, Michael A. Gillette, Charles W. M. Roberts, and Jill P. Mesirov:Metagene projection for cross-platform, cross-species characterization of global transcriptional states. The National Academy of Sciences of the USA.4,5959-5964,2007.
    [52]Li B., Zheng C.H., and Huang D.S.:Locally linear discriminant embedding:An efficient method for face recognition. ICIC, LNCS 4112,3813-3821,2008.
    [53]Zheng C.H., Wu F.L., Li B., and Wang J.:Constrained maximum variance mapping for tumor classification. ICIC, LNCS 5754,102-111,2009.
    [54]Zheng C.H., Li B., Zhang L., and Wang H.Q.:Locally linear discriminant embedding for tumor classification. ICIC, LANI 5227,1093-1100,2008.
    [55]Baldi P. and Hatfield G. W.:DNA microarrays and gene expression. Cambridge University Press,2002.
    [56]Zhang M.Q.:Large-scale gene expression data analysis:a new challenge to computational biologists. Genome Research,9:681-688,1999.
    [57]Grunsein M. and Hogness D.:Colony hybridization:a method for the isolation of cloned DNA that contains a specific gene. PNAS,72:3961-3965,1995.
    [58]Khan J., Wei J. S.,and Ringner M.:Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine,7(6):673-679,2001.
    [59]Huang D. S. and Zheng C. H.:Independent component analysis-based penalized discriminate method for tumor classification using gene expression data. Bioinformatics,22:1855-1862,2006.
    [60]Dudoit S., Fridlyand J. and Speed T. P.:Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association,97(457):77-88,2002.
    [61]Statnikov A., Aliferis C. F., and Tsamardinos I.:A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics,21(5):631-643,2005.
    [62]Pochet N., De Smet F. and Suykens J. A. K.:Systematic benchmarking of microarray data classification:assessing the role of non-linearity and dimensionality reduction. Bioinformatics,20(17):3185-3195,2004.
    [63]Ghosh D.:Penalized discriminant methods for the classification of tumors from microarray experiments. Biometrics,59:992-1000,2003.
    [64]Ghosh D.:Singular value decomposition regression models for classification of tumors from microarray experiments. Pac. Symp. Biocomput,98:18-29,2002.
    [65]Bicciato S., Luchini A. and Di Bello C.:PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics,19(5):571-578, 2003.
    [66]Swiniarski R. W. and Skowron A.:Independent component analysis, principal component analysis and rough sets in face recognition. Transactions on Rough Sets I,3100:392-404,2004.
    [67]Zheng C. H., Huang D. S. and Shang L.:Feature selection in independent component subspace for microarray data classification. Neurocomputing,69(16-18): 2407-2410,2006.
    [68]Pan W., Shen X. T. and Jiang A. X.:Semi-supervised learning via penalized mixture model with application to microarray sample classification. Bioinformatics,22(19):2388-2395,2006.
    [69]Li L. P., Umbach D. M. and Terry, P.:Application of the GA/KNN method to SELDI proteomics data. Bioinformatics,20(10):1638-1640,2004.
    [70]Bevilacqua V., Mastronardi G. and Menolascina F.:Genetic algorithm and neural network based classification in microarray data analysis with biological validity assessment. Computational Intelligence and Bioinformatics, International Conference on Intelligent Computing, ICIC 2006, August 16-19 in Kunming, China, Lecture Notes in Computer Science Pt 3,4115:475-484,2006.
    [71]Chen W., Salojin K. V. and Mi Q. S.:Insulin-like growth factor (IGF)-1/IGF-binding protein-3 complex:therapeutic efficacy and mechanism of protection against typel diabetes. Endocrinology,145(2):627-638,2004.
    [72]Bostic P., Dodd G. L. and Villinger F.:Dysregulation of the Polo-Like Kinase Pathway in CD4+T Cells Is Characteristic of Pathogenic Simian Immunodeficiency Virus Infection. J Virol,78(3):1464-1472,2004.
    [73]Vacca A., Ria R.and Semeraro F.:Endothelial cells in the bone marrow of patients with multiple myeloma. Blood,102(9):3340-3348,2003.
    [74]Belkin M. and Niyogi P.:Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation,15(6):1373-1396,2003.
    [75]Weinberger K.Q. and Saul L.K.:An introduction to nonlinear dimensionality reduction by maximum variance unfolding. American Association for Artificial Intelligence,2006.
    [76]Lin T. and Zha H. B.:Riemannian manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence,30(5):796-809,2008.
    [77]Lin T., Zha H. B. and Lee S. U.:Riemannian manifold learning for nonlinear dimensionality reduction. European Conference on Computer Vision,2006, Lecture notes in computer science 3951:44-55,2006.
    [78]Donoho D.L. and Grimes C.:Hessian eigenmaps:Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences of the United States of America,100(10):5591-5596,2003.
    [79]Li H.F., Jiang T. and Zhang K.S.:Efficient and robust feature extraction by maximum margin criterion. IEEE Transactions on Neural Networks,17(1):157-165,2006.
    [80]Liu J., Cheri S.C. and Tan X.Y.:Efficient and robust feature extraction by maximum margin criterion. IEEETransactions on Neural Networks,18(6):1862-1864,2007.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.