微阵列数据挖掘技术的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
微阵列作为一种新的分子生物学技术,可以同时测量生物样本在几千个基因中的表达水平。从这一实验手段可以得到全基因组的基因表达数据,为获取内在、未知而有意义的生物学知识提供了可能。这一领域研究的主要挑战在于开发生物信息学工具来搜集分析数据。
     本论文研究了有关微阵列数据挖掘所涉及的几个主要问题,包括基因选择,组织分类和通过基因表达数据的调控网络重建等。本文主要的工作归纳如下:
     常用的排列法从微阵列数据中选择的基因集合往往会包含相关性较高的基因,这会影响分类器的性能。为了去除这些冗余基因(特征),提出了无监督的特征选择算法。算法主要包含两步:将原始特征集划分为一组相似的子集(聚类);从每个聚类中选择代表性特征。特征的划分采用特征间的相关性作为测度以k近邻原则来完成。算法无需指定聚类数量,时间复杂度低。真实的生物学数据实验证明该算法可显著提高分类器的分类准确性。
     采用微阵列数据进行组织样本有监督分类所面临的主要挑战在于基因数量远多于样本数量。为此提出了采用人工神经网络集成的分类方法,该方法使用Wilcoxon测试选择用于分类的重要基因,神经网络集成中各个体通过凸伪数据法产生的数据来训练,用简单平均法结合网络个体的测试结果。实际的生物学数据实验表明该方法性能优于单个神经网络,最近邻法和决策树。
     贝叶斯网络是一种表示多变量联合概率分布的图模型,它可以获得变量之间的条件独立关系。由于可以用来表示基因表达的复杂随机过程而受到关注。本文比较了爬山法和马尔可夫蒙特卡洛(MCMC)两种贝叶斯网络学习方法在模拟微阵列数据情况下的性能。结果表明MCMC法要优于爬山法。但是在实际的微阵列数据条件下,贝叶斯网络只能随机确定基因对之间的关系。
     通过微阵列数据挖掘为发现基因调控途径中因果关系提供了可能。提出了基于约束条件的因果关系发现方法,以此来搜索基因之间潜在的因果关系。这一搜索采用Hughes等人已公开的酵母基因组300个表达谱,得到了多个因果关系。粗略分析表明有些关系显示了生物学意义,其他的则有待进一步研究。这一结果表明该方法具有可行性,并且可找到有意义的因果结构。
The new molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from the technology are promising to uncover the implicit, previously unknown and potentially biology knowledge. A major challenge in this area is to develop bioinformatics tools for data collection and analysis.
    In this dissertation several problems about microarray data mining techniques are investigated, which includes gene selection, tissue classification and genetic network construction using gene expression data. The main contributions of this dissertation are summarized as below:
    Gene set of interest typically selected by usual ranking methods from microarray data will contain many highly correlated genes. This situation will degrade the performance of classifiers. For filtering these redundant genes (features), an unsupervised feature selection algorithm was proposed. The task of the algorithm involves two steps, namely, partitioning the original feature set into a number of homogeneous subsets (clusters) and selecting a representative feature from each such cluster. Partitioning of the features is done based on k-NN (k nearest neighbor) principals using the pairwise feature correlation measures. This method dose not need to specify the optimal number of clusters in advance and its computational complexity is low. Real biological data experiments have shown that this algorithm will significantly increase the classification accuracy of the existing classifiers.
    Accurate supervised classification of tissue samples in use of large-scale gene expression data presents major challenges due to the number of genes far exceeding the number of samples. Thus, a classification method using artificial neural network ensembles was proposed. In this method, significant genes for classification were selected by Wilcoxon test. Each member of neural network ensembles is trained by different datasets generated by convex pseudo-data methods. The predictions of those individual networks were combined by simple average method. Real biological data experiments have shown that this classification method outperformed than single neural networks, 1-nearest-neighbor classifiers and decision trees.
    
    
    A Bayesian network is a graphical model of joint multivariate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes of gene expression. We compared the results of using hill-climbing method and Markov chain Monte Carlo method to learning Bayesian networks from simulated microrray data. Our analysis suggests that MCMC performed better than hill-climbing method. However, we find Bayesian network is at chance for determining the existence of a regulatory connection between gene pairs.
    There is great potential for mining microarray databases to discover causal relationships in the gene-regulation pathway. A constrained-based causal discovery method was presented to search for the underlying causal relationships between genes. The search uses published data set from Hughes et al. of 300 expression profiles for yeast. Using this method, a number of causal relationships were found. A cursory analysis shows some of these relationships make sense biologically sensible, others suggesting new hypothesis that may deserve further investigation. The results indicate that the approach proposed here is both computationally feasible and successful in identified interesting causal structures.
引文
Akutsu, T., Miyano, S., and Kuhara, S. (1999). Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pac. Symp. Biocomput. 17-28.
    Akutsu, T., Miyano, S., and Kuhara, S. (2000a). Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function. J. Comput. Biol. 7, 331-343.
    Akutsu, T., Miyano, S., and Kuhara, S. (2000b). Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics. 16, 727-734.
    Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511.
    Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96, 6745-6750.
    Anand, R., Mehrotra, K., Mohan, C. K., and Ranka, S. (1995). Efficient classification for multiclass problems using modular neural networks, leee Transactions on Neural Networks 6, 117-124.
    Ben Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2000a). Tissue classification with gene expression profiles. J. Comput. Biol. 7, 559-583.
    Ben Dor, A., Friedman, N., and Yakhini, Z. (2000b). Scoring genes for relevance. AGL-2000-13.
    Ben Dor, A., Shamir, R., and Yakhini, Z. (1999). Clustering gene expression patterns. J. Comput. Biol. 6, 281-297.
    Bicciato, S., Luchini, A., and Di Bello, C. (2003). PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics 19, 571-578.
    Breiman L(1999). Using convex pseudo-data to increase prediction accuracy. Technical Report 513, Statistics Department, U. C. Berkeley, USA.
    
    
    Breiman, L. (1996a). Bagging predictors. Machine Learning 24, 123-140.
    Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. Annals of Statistics 24, 2350-2383.
    Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. CA: Wadsworth International Group).
    Butte, A. J. and Kohane, I. S. (2000). Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418-429.
    Carr, D. B., Somogyi, R., and Michaels, G. (1997). Templates for looking at gene expression clustering. Statistiacal Computing & Statistical Graphics Newsletter 8, 20-29.
    Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X. C., Stern, D., Winkler, J., Lockhart, D. J., Morris, M. S., and Fodor, S. P. (1996). Accessing genetic information with high-density DNA arrays. Science 274, 610-614.
    Chen, T., He, H. L., and Church, G. M. (1999). Modeling gene expression with differential equations. Pac. Symp. Biocomput. 29-40.
    Cheng, Y. Q., Zhuang, Y. M., and Yang, J. Y. (1992). Optimal fisher discriminant analysis using the rank decomposition. Pattern Recognition 25, 101-111.
    Chickering, D. M. (1995). A transformational characterization of equivalent Bayesian network structures. UAI'95 87-98.
    Chickering, D. M. (1996). Learning Bayesian networks is NP-complete. "Learning from Data: Artificial Intelligence and Statistics Ⅴ", Springer Verlag.
    Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis, R. W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65-73.
    Cooper, G. F. (1997). A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery 1, 203-224.
    Creighton, C. and Hanash, S. (2003). Mining gene expression databases for association rules. Bioinformatics 19, 79-86.
    
    
    D'haeseleer, P., Liang, S., and Somogyi, R. (2000). Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 16, 707-726.
    D'haeseleer, P., Wen, X, Fuhrman, S., and Somogyi, R. (1999). Linear modeling of mRNA expression levels during CNS development and injury. Pac. Symp. Biocomput. 41-52.
    De Jong, H. (2002). Modeling and simulation of genetic regulatory systems: A literature review. Journal of Computational Biology 9, 67-103.
    DeRisi, J. L., lyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680-686.
    Dettling, M. and Buhlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19, 1061-1069.
    Deutsch, J. M. (2003). Evolutionary algorithms for finding optimal gene sets in microarray prediction. Bioinformatics 19, 45-52.
    Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97, 77-87.
    Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. PNAS 95, 14863-14868.
    Ekins, R. P. (1998). Ligand assays: from electrophoresis to miniaturized microarrays. Clin. Chem. 44, 2015-2030.
    Freund, Y. (1995). Boosting a weak algorithm by majority. Information and Computation 121, 256-285.
    Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 119-139.
    Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science 303, 799-805.
    Friedman, N., Linial, M., Nachman, I., and Pe'er, D. (2000). Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601-620.
    
    
    Gene Ontology Consortium(2004). The Gene Ontology(GO)database and informatics resource. Nucl. Acids. Res. 32, D258-D261.
    George, H. J., Ron, K., and Karl, P. (1994). Irrelevant features and the subset selection problem. Proceedings of the International Conference on Machile Learning(ICML 1994)121-129.
    Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537.
    Han, J. W. and Kamber, M. (2000). Data mining concepts and techniques. Morgan Kaufinann Publishers).
    Hansen, L. K. and Salamon, P. (1990). Neural network ensembles, leee Transactions on Pattern Analysis and Machine Intelligence 12.
    Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., and Young, R. A. (2001). Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput. 422-433.
    Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., and Young, R. A. (2002). Combining location and expression data for principled discovery of genetic regulatory network models. Pac. Symp. Biocomput. 437-449.
    Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D., and Brown, P. (2000). 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1, RESEARCH0003.
    Heckerman, D. (1998). A tutorial on learning with Bayesian networks, in Learning in Graphical Models, Kluwer Academic Publishers), pp. 301-354.
    Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197-243.
    Herzel, H., Beule, D., Kielbasa, S., Korbel, J., Sers, C., Malik, A., Eickhoff, H., Lehrach, H., and Schuchhardt, J. (2001). Extracting information from cDNA arrays. Chaos. 11, 98-107.
    Heyer, L. J., Kruglyak, S., and Yooseph, S. (1999). Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9, 1106-1115.
    
    
    Hornik, K. M., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366.
    Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., Coffey, E., Dai, H., He, Y. D., Kidd, M. J., King, A. M., Meyer, M. R., Slade, D., Lum, P. Y., Stepaniants, S. B., Shoemaker, D. D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M., and Friend, S. H. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109-126.
    Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. (2001). Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 929-934.
    Imoto, S., Goto, T., and Miyano, S. (2002). Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac. Symp. Biocomput. 175-186.
    Jaeger, J, Sengupta, R., and Ruzzo, W. L. (2003). Improved gene selection for classification of microarrays. Pac. Symp. Biocomput. 53-64.
    Jornsten, R. and Yu, B. (2003). Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics 19, 1100-1109.
    Keller, A. D., Schummer, M., Hood, L., and Ruzzo, W. L. (2000). Bayesian Classification of DNA Array Expression Data. Technical Report UW-CSE-2000-08-01.
    Khan, J, Simon, R., Bittner, M., Chen, Y., Leighton, S. B., Pohida, T., Smith, P. D., Jiang, Y., Gooden, G. C., Trent, J. M., and Meltzer, P. S. (1998). Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res 58, 5009-5013.
    Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7, 673-679.
    Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., and Mallick, B. K. (2003). Gene selection: a Bayesian variable selection approach. Bioinformatics 19, 90-97.
    Liang, S., Fuhrman, S., and Somogyi, R. (1998). Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Syrup. Biocomput. 18-29.
    
    
    Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 1675-1680.
    Lukashin, A. V. and Fuchs, R. (2001). Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17, 405-414.
    Madigan, D., Raftery, A., Volinsky, C., and Hoeting, J. (1996). Bayesian model averaging. Proceedings of the AAAI Workshop on Integrating Multiple Learned Models.
    Michaels, G. S., Carr, D. B., Askenazi, M., Fuhrman, S., Wen, X., and Somogyi, R. (1998). Cluster analysis and data visualization of large-scale gene expression data. Pac. Symp. Biocomput. 42-53.
    Mitra, P., Murthy, C. A., and Pal, S. K. (2002). Unsupervised feature selection using feature similarity. Ieee Transactions on Pattern Analysis and Machine Intelligence 24, 301-312.
    Ooi, C. H. and Tan, P. (2003). Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19, 37-44.
    Pan, W. (2002). A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18, 546-554.
    Park, P. J., Pagano, M., and Bonetti, M. (2001). A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52-63.
    Pe'er, D., Regev, A., Elidan, G., and Friedman, N. (2001). Inferring subnetworks from perturbed expression profiles. Bioinformatics 17, 215S-2224.
    Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press NY).
    Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S., and Golub, T. R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. PNAS 98, 15149-15154.
    Raychaudhuri, S., Stuart, J. M., and Altman, R. B. (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455-466.
    Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees, C., Spellman, P., lyer, V., Jeffrey, S. S., Van de, R. M., Waltham, M., Pergamenschikov, A., Lee, J. C., Lashkari, D., Shalon, D., Myers, T. G.,
    
    Weinstein, J. N., Botstein, D., and Brown, P. O. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. 24, 227-235.
    Sharan, R. and Shamir, R. (2000). CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 307-316.
    Silverstein, C., Brin, S., Motwani, R., and Ullman, J. (2000). Scalable techniques for mining causal structures. Data Mining and Knowledge Discovery 4, 163-192.
    Skurichina, M. and Duin, R. P. W. (1998). Bagging for linear classifiers. Pattern Recognition 31, 909-930.
    Spirtes, P. (2001). An anytime algorithm for causal inference. AI and Statistics 2001.
    Tamada, Y., Kim, S., Bannai, H., Imoto, S., Tashiro, K., Kuhara, S., and Miyano, S. (2003). Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics. 19 Suppl 2, Ⅱ227-Ⅱ236.
    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci U. S. A 96, 2907-2912.
    Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., and Church, G. M. (1999). Systematic determination of genetic network architecture. Nat. Genet. 22, 281-285.
    Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci U. S. A 98, 5116-5121.
    Wahde, M. and Hertz, J. (2000). Coarse-grained reverse engineering of genetic regulatory networks. Biosystems 55, 129-136.
    West, M., Bianchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A., Jr., Marks, J. R., and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci U. S. A 98, 11462-11467.
    Xing, E., Jordan, M, and Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. Proceedings of the 18th International Conference on Machine Learning(ICML2001).
    Xu, Y., Selaru, F. M., Yin, J., Zou, T. T., Shustova, V., Mori, Y., Sato, F., Liu, T. C., Olaru, A., Wang, S., Kimos, M. C., Perry, K., Desai, K., Greenwald, B. D., Krasna, M. J., Shibata, D., Abraham, J. M., and
    
    Meltzer, S. J. (2002). Artificial neural networks and gene filtering distinguish between global gene expression profiles of Barrett's esophagus and esophageal cancer. Cancer Res 62, 3493-3497.
    Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17, 763-774.
    Yoo, C., Thorsson, V., and Cooper, G. F. (2002). Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Pac. Symp. Biocomput. 498-509.
    Zak, D. E., Doyle, F. J., and Schwaber, J. S. (2002). Local identifiability: when can genetic networks be identified from microarray data? Proceedings of the Third International Conferene on System Biology 236-237.
    周志华and陈世福(2002).神经网络集成.计算机学报 25,1-8.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700