肿瘤基因表达谱分类方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
肿瘤分类研究是以DNA微阵列技术为基础,对不同肿瘤样本的基因进行测量,力求找出具有表达差异的组织基因,以及差异表达与病理表现之间的联系。虽然模式识别领域的分类算法众多,但在分类过程中许多问题有待进一步解决。由于基因表达谱数据具有高维度、低样本的特点,传统分类方法面对这样的数据,很难取得较好的分类结果,且运算复杂度高效率低。
     本文提出了三种肿瘤数据的分类算法,主要研究内容如下:
     1、将直方图理论应用于肿瘤基因表达谱数据的分类上。首先计算每个基因的信息熵,根据熵值剔除冗余基因,然后对基因表达谱数据的直方图进行统计,选取峰谷差和峰谷比最大的基因作为特征基因,最后分别用支持向量机和K近邻分类器进行分类实验。
     2、将非负矩阵分解和Normal Matrix谱分解理论应用于肿瘤基因表达谱数据的分类上。首先利用fdr_test记分准则粗略除去噪声基因以实现基因表达谱数据的初步降维,进而运用非负矩阵分解萃取基因间的综合属性,通过综合属性构造样本间的Normal Matrix并对其进行奇异值分解获取表征样本类别属性的谱分量,进而实现肿瘤类型的分类识别。
     3、提出一种基于PCA和最小生成树的肿瘤基因表达谱数据的分类算法。首先通过PCA方法完成基因表达谱数据的降维,然后将肿瘤样本映射到高维空间的点,构造其邻接矩阵,根据邻接矩阵构造肿瘤样本的无向完全图,生成图的最小树,并删除树中最长距离的边,将树分成两颗子树,一个子树对应的是正常样本,另外一个子树对应的是肿瘤样本。
Classification of tumor gene expression data is on the basis of DNA microarray technology, which intends to measure genes expression of different samples, finds out the genes with differences expression between different samples and the essential relationship between the different genes and the lesion organizations. Although the classification algorithms have been significant developed in pattern recognition field, it still has many problems remained to be solved. Due to the gene expression data having two characteristics:high dimension and low sample, traditional machine learning method cannot get better classification results with high computational complexity and low efficiency.
     This thesis proposes three algorithms to classify tumor gene expression data. The main research contents are described as follows:
     1. This thesis proposes an algorithm to classify tumor gene expression data using histogram theory. Firstly, calculate the entropy of each gene to eliminate redundant genes. Then, select the genes with highest difference and ratio between Peak and Valley as feature genes based on histogram theory. Finally, the classification experiments are performed by Support Vector Machine(SVM) and K Nearest Neighbor(KNN) classifiers.
     2. Apply nonnegative matrix decomposition and Normal_Matrix spectrum decomposition theory to the classification of gene expression data. Firstly, remove the noise genes using fdr_test scoring criteria to reduce the dimensions of gene expression data preliminarily. Then, extract the comprehensive properties between genes using the nonnegative matrix decomposition, and construct the Normal_Matrix between samples based on the comprehensive properties. Finally, the classification of tumor types is realized by the spectral component gained by singular value decomposition which describes the class attribute of samples.
     3. This thesis proposes an algorithm to classify tumor gene expression data based on Principal Component Analysis(PCA) and Minimum Spanning Tree theory. Firstly, reduce the dimensions of tumor gene expression data preliminarily using PCA theory. Then, map samples to a high-dimensional space of points, and construct the adjacency matrix. Finally, construct the undirected complete graph of tumor samples using the adjacency matrix. Minimum Spanning Tree is generated and the longest edge of the tree is deleted. Finally, Minimum Spanning Tree is divided into two subtrees. The normal samples correspond to one subtree, and tumor samples points correspond to another one.
引文
[1]黄德双.基因表达谱数据挖掘方法研究[M].北京:科学出版社,2009.
    [2]Golub TR, Slonim DK, Tamayo P, et al. Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring[J]. Science, 1999,286(5439):531-537.
    [3]孙晶京,王力波,罗伟.肿瘤诊断中的特征基因提取[J].计算机工程与应用,2010,46(7):218~220.
    [4]Ramaswamy S, Golub T. R. DNA microarrays in clinical on cology[J]. Journal of Clinical Oncology,2002,20(7):1932-1941.
    [5]Lu Y, Han J W. Cancer classification using gene expression data[J]. Inform Syst, 2003,28(4):243-268.
    [6]Li X, Rao S, Zhang T, et al. An ensemble met hod for gene discovery based on DNA microarray data[J]. Science in China (Series C),2004,47 (5):396-405.
    [7]Singh D, Febbo P G, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior[J]. Cancer Cell,2002, 1(2):203-209.
    [8]Guyon I, Weston J, Barnhill S, et al. Gene Selection for Cancer Classification using Support Vector Machines[J]. Machine Learning,2002,1(46):389-422.
    [9]Vinayagam A, Kinig R, Moormann J, et al. Applying Support Vector Machines for Gene ontology based gene function prediction[J]. BMC Bioinformatics,2005, 5:116.
    [10]Valentini G, Muselli M, Ruffino F. Cancer recognition with bagged ensembles of support vector machines[J]. Neurocomputing,2004,56:461-466.
    [11]Zhang HH, Ahn J, Lin X, et al. Gene selection using support vector machines with non-convex penalty[J]. Bioinformatics,2006,22(1):88-95.
    [12]Huerta E B, Duval B, Jin-Kao Hao. A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data[J]. EvoWorkshops,2006, LNCS 3907:34-44.
    [13]Higham D J, Kalna G, Kibble M. Spectral clustering and its use in bioinformatics[J]. Journal of Computational and Applied Mathematics,2007, 204(1):25-37.
    [14]Yang Yonggao, Chen JX, Kim Woosung. Gene expression clustering and 3D visualization[J]. Computing in Science and Engineering,2008,5(5):37-43.
    [15]Ressom HH, Wang D, Natarajan P. Adaptive double self-organizing map and its application in gene expression data[R]. Proceedings of the International Joint Conference on Nueral Networks 2003.
    [16]Patterson A D, Li H, Eichler G S, et al. UPLC-ESI-TOFMS-based metabolomics and gene expression dynamics inspector self-organizing metabolomic maps as tools for understanding the cellular response to ionizing radiation[J]. American Chemical Society,2008,80(3):665-674.
    [17]Zhou Xiaobo, Wang Xiaodong, Dougherty ER. A Bayesian approach to nonlinear porbit gene selection and classification [J]. Journal of the Franklin Institute,2004, 341(1,2):137-156.
    [18]Haferlach T, Kohlmann A, Wieczorek L, et al. Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia:report from the international microarray innovations in Leukemia study group[J]. Journal of Clinical Oncology,2010,28(15):2529-2537.
    [19]Yang A J, Song X Y. Bayesian variable selection for disease classification using gene expression data[J]. Bioinformatics,2010,26(2):215-222.
    [20]Ringner M, Peterson C. Microarray2based cancer diagnosis with artificial networks[J]. BioTechniques,2003,39:530-535.
    [21]Liu B, Cui Q, Jiang T, et al. A combinational feature selection and ensemble neural network method for classification of gene expression data[J]. BMC Bioinformatics,2004,5:136.
    [22]Farid A E. Artificial neural networks for diagnosis and survival prediction in colon cancer[J]. Molecular Cancer,2005,4:29-41.
    [23]Statnikov A, Aliferis CF, Tsamardinos I, et al. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis[J]. Bioinformatics,2005,21 (5):631-643.
    [24]Li J, Tang XL, Wang YD, et al. Research on Gene Expression Data Based on Clustering/classification Technology[J]. Chinese Journal o f Biotechnology, 2005,21(4):667-673.
    [25]SUGIYAMA A, KOTANI M. Analysis of gene expression data by using self-organizing maps and k-means clustering[J]. Neural Network,2002(5): 1342-1345.
    [26]Liu HQ, Li JY, et al. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns[J]. Genome Informatics.2002,13:51-60.
    [27]Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiplecancer types by shrunken centroids of gene expression [J]. PNAS,2002,99(10):6567-6572.
    [28]Nishimura K, Abe K, Ishikawa S, et al. A PCA Based Method of Gene Expression Visual Analysis [J]. Genome Informatics,2003,14:346-347.
    [29]De-Shuang Huang, Chun-Hou Zheng. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data[J]. Bioinformatics,2006,22(15):1855-1862.
    [30]Wang Shu-lin, Wang Ji. The Classification of Tumor Using Gene Expression Profile Based on Support Vector Machines and Factor Analysis[C]. Sixth International Conference on Intelligent Systems Design and Applications,2006, Oct,2:471-476.
    [31]Ruan XG, Chao H. Selection of Feature Genes in Cancer Classification[J]. Control Engineering of China,2007,14(4):373-380.
    [32]Alba E, Garcia-Nieto J, Jourdan L, and Talbi E G Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms [C]. IEEE Congress on Evolutionary Computation,2007, Sept:284-290.
    [33]Yeung C W, Leung F H F, Chan K Y, and Ling S H. An Integrated Approach of Particle Swarm Optimization and Support Vector Machine for Gene Signature Selection and Cancer Prediction[C]. Proceedings of International Joint Conference on Neural Networks Atlanta, Georgia, USA,2009, June:3450-3456.
    [34]Yu Hua-long, Xu Sen. Simple Rule-based Ensemble Classifiers for Cancer DNA Microarray Data Classification[C]. IEEE International Conference on Computer Science and Service System (CSSS),2011, June:2555-2558.
    [35]Ghorai S, Mukherjee A, Sengupta S, and Dutta P K. Cancer Classification from Gene Expression Data by NPPC Ensemble[C]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2011, May,8(3):659-671.
    [36]Zheng Chun-Hou, Zhang Lei, To-Yee Ng, Simon C K S, and Huang De-Shuang. Metasample-Based Sparse Representation for Tumor Classification[C]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2011, Sept,8(5):1273-1282.
    [37]Saraswathi S, Sundaram S, Sundararajan N, et al. ICGA-PSO-ELM Approach for Accurate Multiclass Cancer Classification Resulting in Reduced Gene Sets in Which Genes Encoding Secreted Proteins Are Highly Represented[C]. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2011, March,8(2):452-463.
    [38]Shulin Wang, Huowang Chen, Ji Wang, et al.Molecular Diagnosis of Tumor Based on Independent Component Analysisand Support Vector Machines[R]. International Conference on Computational Intelligence and Security,2006: 362-367.
    [39]Muni S. Srivastava, Tatsuya Kubokawa. Comparison of discrimination methods for High Dimensional Data[J]. Japan Statist,2007,37(1):123-134.
    [40]Shital Shah, Andrew Kusiak. Cancer gene searchwith data-mining and genetic algorithms[J]. Computers in Biology and Medicine,2007,37:251-261.
    [41]Alok Sharma, Kuldip K, Paliwal. Cancer classification by gradient LDA technique using microarray gene expressiondata[J]. Data & Knowledge Engineering,2008,66:338-347.
    [42]Naoto Yukinawa, Shigeyuki Oba, Kikuya Kato, et al. Optimal Aggregation of Binary Classifiers for Multiclass Cancer Diagnosis Using Gene Expression Profiles[J]. Computational Biology and Bioinformatics,2009,6(2):333-343.
    [43]Zhenqiu Liu, Dechang Chen, Halima Bensmail. Gene Expression Data Classification With Kernel Principal Component Analysis[J]. Journal of Biomedicine and Biotechnology,2005,2:155-159.
    [44]Haifeng Li, Keshu Zhang, Tao Jiang. Robust and Accurate Cancer Classification with Gene Expression Profiling[R]. Computational Systems Bioinformatics Conference,2005:310-321.
    [45]Xinguo Lu, Yaping Lin, Xiaolin Yang, Lijun Cai, Haijun Wang, Gustaph Sanga. Using most similarity tree based clustering to select the top of most discriminating genes for cancer detection[C]. Artifical Intelligence and Soft Computing— ICAISC 2006,8th International Conference, Zakopane, Poland: 931-940.
    [46]Ching Wei Wang. New Ensemble Machine Learning Method for Classification and Prediction on Gene Expression Data[R]. Proceedings of the 28th IEEE EMBS Annual International Conference,2006:3478-3481.
    [47]Arpita Das, Mahua Bhattacharya. GA Based Neuro Fuzzy Techniques for Breast Cancer Identification[R]. International Machine Vision and Image Processing Conference,2008:136-141.
    [48]K.Y. Chan, H.L. Zhu, C.C. Lau, et al. Gene Signature Selection for Cancer Prediction Using an Integrated Approach of Genetic Algorithm and Support VectorMachine[R]. IEEE Congress on Evolutionary Computation,2008:217-224.
    [49]Jin-Mao Wei, Xin-Bin Yang, Shu-Qin Wang, et al. A Novel Rough Hypercuboid Method for Classifying Cancers Based on Gene Expression Profiles[R]. Fifth International Conference on Fuzzy Systems and Knowledge Discovery,2008: 262-266.
    [50]Xiaogang Ruan, Jinlian Wang, Hui Li, et al. A Method for Cancer Classification Using Ensemble Neural networks with Gene Expression Profile[R]. The International Conference on Bioinformatics and Biomedical Engineering,2008: 342-346.
    [51]Kasturi.J, Acharya.R. A new information-theoretic dissimilarity for clustering time-dependent gene expression profiles modeled with radial basis functions[R]. IEEE International Joint Conference on Neural Networks,2008:2857-2864.
    [52]E. Fersini, I. Giordani, E. Messina, F. Archetti. Relational Clustering and Bayesian Networks for Linking Gene Expression Profiles and Drug Activity Patterns[R]. IEEE International Conference on Bioinformatics and Biomedicine Workshop,2009:20-25.
    [53]Thomas C. Chenl, Sandeep Sanga, Tina Y. Chou. Neural Network with K-Means Clustering via PCA for Gene Expression Profile Analysis[R]. World Congress on Computer Science and Information Engineering,2009:670-673.
    [54]Fang-Xiang Wu, Fang-Xiang Wu. Dynamic-Model-Based Method for Selecting Significantly Expressed Genes From Time-Course Expression Profiles[J]. IEEE Transactions on Information Technology in Biomedicine,2010,14(1):16-22.
    [55]Furey T.S., Cristianini N., Duffy N., Bednarski D.W.,Schummer M., and Haussler D.. Support vector machine classification and validation of cancer tissue samples using microarray expression data[J]. Bioinformatics,2000,16(10): 906-914.
    [56]Ben-Dor A., Bruhn L., Friedman N., Nachman I.,Schummer M., and Yakhini N.. Tissue classification with gene expression profiles[J]. Journal of computional Biology,2000,7:559-583.
    [57]Shulin Wang, Huowang Chen, Ji Wang, et al. Molecular Diagnosis of Tumor Based on Independent Component Analysis and Support Vector Machines[J]. Lecture Notes in Computer Science,2006, LNAI 4456:46-56.
    [58]Cichocki A, Amari S. Adaptive Blind Signal and Image Processing:Learning Algorithms and Applications[M]. New York:John Wiley & Sons,2002.
    [59]Ross D, S. Thrun, et al. Advances in Neural Information Processing Systems[J], MIT Press,2003.
    [60]Liu W X, Zheng N N, Lu X F. Non-negative matrix factorization for visual coding[C]. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.2003,3:293-296.
    [61]Guillamet D, Vitria J, Schiele B. Introducing a weighted non-negative matrix factorization for image classification. Pattern Recognition Letters[J],2003, 24(14):2447-2454.
    [62]Ahn J H, Choi S, Oh J H. A multiplicative up-propagation algorithm[C]. Interal Conference on Machine Learning,2004:17-24.
    [63]Y. Wang Y, Jiar Y, Hu C, et al. Fisher non-negative matrix factorization for learning local features. Asian Conference on Computer Vision[C], Korea, January 27-30,2004.
    [64]Hoyer P O. Non-negative Matrix Factorization with sparseness constraints[J]. Journal of Machine learning research,2004,5:1457-1469.
    [65]Wild S M, Curry J, Dougherty A. Improving Non-Negative Matrix Factorizations Through Structured Initialization[J]. Pattern Recognition,2004,37:2217-2232.
    [66]Sajda P, Du S, Brown T R, et al. Non-negative matrix factorization for rapid recovery of constituent spectra in magnetic resonance chemical shift imaging of the brain[J]. IEEE Transactions on Medical Imaging,2004,23(12):1453-1465.
    [67]Liu W X, Zheng N N. Learning Sparse Features for Classification by Mixture Models[J]. Pattern Recognition Letters,2004,25(2):155-161.
    [68]Liu W X, Zheng N N, Li X. Relative gradient speeding up additive updates for nonnegative matrix factorization[J]. Neurocomputing,2004,57:493-499.
    [69]Tropp J A. Literature Survey:Non-Negative Matrix Factorization[EB/OL]. http://www.ece.utexas.edu/-bevans/courses/ee381k/projects/spring03/tropp/LitS urveyReport.pdf.
    [70]Chu M, Plemmons R. Nonnegative (Non-negative) Matrix Factorization and Applications[EB/OL].http://www.wfu.edu/-plemmons/papers/chu_ple_survey.pd f,2005.
    [71]Plumbley M D. Conditions for non-negative independent component analysis. IEEE Signal Processing Letters[J],2002,9(6):177-180.
    [72]Liu A, Zhang Y, Gehan E, et al. Block principal component analysis with application data classification[J]. Stat Med,2002,21:3465.
    [73]Oja O, Plumbley M D. Blind Separation of Positive Sources using Non-Negative PCA[C]. To appear in Proceedings of the Fourth International Symposium on Independent Component Analysis (ICA2003). Nara, Japan, April 1-4,2003.
    [74]Peterson LE. Partitioning large-sample microarray-based gene expression profiles using principal components analysis[J]. Comut Methods Programs Biomed,2003,70(2):107.
    [75]Sharov AA, Dudekula DB, Ko MS. A web based tool for principal component and significance analysis of microarray data[J]. Bioinformatics,2005,21:2548.
    [76]Matthias S, Fatma K, Charles LG, et al.Non-linear PCA:a missing data paaroach[J]. Bioinformatics,2005,21(20):3887.
    [77]Wang A, Gehan EA. Gene selection for microarray data analysis using principal component analysis[J]. Stat Med,2005,24:2069.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700