用户名: 密码: 验证码:
基于注释信息的基因芯片数据分析
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
基因芯片技术使得我们可以同时观测成千上万个基因的表达。这一技术一经问世便得到生物学和医学领域的广泛应用。计算生物学和生物信息学中众多新算法被提出以分析芯片实验产生的海量数据。但多数方法仅仅只利用基因表达量这一个属性。基因自身的其它属性,如功能相似性、所属的生物学通路和基因产物的相互作用,在这些算法中都被忽略了。为了更好地利用基因的其它属性来辅助芯片数据分析,本研究将基因注释信息这一重要资源引入算法设计和评估,工作着眼于差异基因选择和基因聚类分析这两个最重要的方面。
     寻找基因表达中与特定生物条件相关的基因是最常见的实验设计,这一任务却充满了挑战性。多数的基因选择算法都受到表达数据的高维度和高噪声考验。我们对比了五种常用的基因选择算法,包括倍数法(fold-change, FC)、t-检验法(t-est)、基因芯片显著性分析算法(significance analysis of microarray, SAM)、Baldi经验贝叶斯算法(Baldi)和基因芯片线性模型算法(linear models for microarray analysis, Limma)。结果证实Limma在多数情况下具有较好的性能。同时我们也指出了Limma算法的不足。它视每个基因为独立表达、互不影响,忽略了基因和基因之间的相互作用。这一相互作用往往是多基因疾病尤其是癌症的致病机理。针对这一点,我们提出一种新的算法Deam。Deam保持了Limma中先验分布的模型,但改进了超参数的估计方法。Deam引入Gene Ontology的注释信息,通过注释的语义相似性来衡量的基因与基因之间的功能相似性。算法中依次为每个基因寻找功能相似的基因群体,并用群体的表达信息来加强该基因方差的估计。三组基因芯片实验数据集和一组模拟数据证实,在多数情况下Deam相对Limma具有更好的性能。随着现有衡量基因相似性算法的改进和新算法的提出,Deam具有更大的性能提升空间。我们给出了Deam算法的R语言程序实现。为了方便没有统计和编程背景的研究者使用,还建立了该算法基于RApache模块的Web前端。
     基因芯片数据分析的另一个常用手段是聚类分析。聚类分析相对于基因选择更具开放性和不确定性。选择怎样的距离度量基因,在这些度量基础上使用什么聚类算法,如何评价聚类效果的优劣,这些问题一直是众说纷纭。我们首先提出一个结合Kyoto Encyclopedia of Genes and Genomes (KEGG)生物学通路数据信息来衡量聚类性能的外部评价指标PS。在证明了该指标的可靠性之后,我们以它为标准比较了常用的六种聚类算法,包括四种层次聚类算法,k-重心算法和自组织图算法。结果证实Ward层次聚类和k-重心聚类具有较好效果。另一方面,在聚类的相关工作中主成分分析经常被用于缩减基因表达数据的维度。我们试图分析主成分是否可以更好地抓住类和类之间的结构信息。仍然使用指标PS,我们比较了对原有数据直接聚类和对它们的主成分集合聚类的效果后,得出结论基于主成分集合的聚类并不一定能提高聚类性能。因此我们建议谨慎使用主成分的集合来代替原有数据进行基因芯片聚类分析。
     本研究的主要创新之处在于:
     1)提出了一个新的基因选择算法Deam,结合基因的注释信息对表达数据先验分布进行估计。实验结果证明Deam比现有算法具有更好的性能和更大的提升空间。
     2)提出了一种新的基因表达模拟数据生成方法。以往的模拟数据多是通过来自不同均值的一元正态分布描述表达差异性,而我们提出的方法使用多元正态分布生成模拟数据,分布参数中的协方差矩阵体现了基因和基因之间的关联性。
     3)给出了一种高效的方法,利用人类全基因组的基因相似性矩阵生成针对特定芯片的基因探针相似性矩阵。将数据维度从n×n缩减为n×d',其中n>>d0'。
     4)实现了Deam算法在Web上的应用。
     5)提出了一种新的利用生物学通路资源衡量聚类算法性能的外部评价指标PS。聚类的性能体现在一类中来自同一通路的基因的聚集性。PS的有效性被实验结果证实。
     6)利用提出的外部评价指标PS,证明了基于主成分集合的聚类性能不一定好于基于原数据的聚类。
Microarrays enable simultaneous measurement of expression levels of tens of thousands of genes and have found widespread applications in biological and biomedical research. The challenge of interpreting the vast amount of data from microarrays has led to the development of new methods in the fields of computational biology and bioinformatics. However, most algorithms make use of expression values only. Other attributes of genes, such as functional similarities, pathway information and protein-protein interactions, are ignored. In order to take advantage of these attributes, we have incorporated annotation resources into microarray analysis. This paper focuses on the two most important issues:gene selection and gene clustering.
     A basic, yet challenging task is the identification of changes in gene expression that are associated with particular biological conditions, which is called gene selection. Most gene selection algorithms suffer from the dimensionality issue and the noise inherent in expression data. Five gene selection algorithms including fold change (FC), t-test, significance analysis of microarray (SAM), Baldi's empirical Bayes method (Baldi) and linear models for microarray analysis (Limma) were compared. The results revealed that Limma is the most powerful one in most situations. However, genes are assumed to be expressed independently in Limma. Correlation between genes is a very informative resource but is not considered. Keeping the form of prior distribution in Limma, an entirely new prior estimation method was proposed. This method is noted as Deam. It incorporates the functional similarities between genes. Functional similarities are measured by gene ontology annotations. Three publicly available microarray experiment data sets and simulated data were used to evaluate the method proposed. The results obtained reveal that Deam has a performance better than Limma in detecting differently expressed genes in most cases. In addition, it has more potential as more algorithms for measuring functional similarities between genes are proposed and existing ones are improved. For the convenience of biological researchers without programming and statistical background, the implementation of Deam in a web front-end using the RApache model was provided.
     Gene clustering is another important approach in microarray analysis. Compared to gene selection, clustering is a more complicated open problem. Difficulties remain in selection of distance metrics, selection of clustering algorithms and evaluation of clustering results. An external criterion was proposed using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations to measure the performance of clustering algorithms. The criterion is noted as PS in this paper. After the feasibility of this external criterion was proved, it was used to compare the six commonly used clustering algorithms including four different types of hierarchical clustering, k-medoids clustering and self organizing map. It was shown that hierarchical clustering using Ward's method and k-medoids outperform the others. In literatures regarding clustering, principal component analysis is sometimes applied to reduce the dimensionality of the gene expression data prior to clustering. Using the external criterion suggested, we tried to study the effectiveness of principle components in capturing the cluster structure. The quality of clusters obtained from the original data and those obtained after projecting onto the subsets of the principal component axes were compared. The result showed that clustering with the set of principle components instead of the original variables was not necessarily improved. Overall, we would not recommend using principle components as input for clustering in most situations.
     The major innovations in this paper are summarized below.
     1. A new gene selection method, Deam, was proposed, which took full advantage of functional similarities between genes. Experiment results revealed that Deam had a better performance than current methods in detecting differently expressed genes in most cases.
     2. Instead of using two univariate normal distributions with different means as suggested in prior research, a new method using multivariate normal distributions to generate simulated expression data was proposed. The covariance matrices in such distributions reflect the correlations between their components.
     3. Given the functional similarity matrix for the whole human genome, a method was proposed with high efficiency to construct platform specific similarity matrices. It reduces the dimension of matrices from n×n to n×d0', where n>>d0'.
     4. The implementation of Deam in R language and a web front-end were provided.
     5. An external criterion using pathway annotations to measure the performance of clustering algorithms was provided. The performance was reflected in the aggregation of genes in the same pathway. The feasibility of this criterion was proved.
     6. The effectiveness of principle components in capturing the cluster structure was studied. Using principle components as input for clustering was not recommended.
引文
1 Schena M, Shalon D, Davis RW et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray [J]. Science,1995,270(5235):467-470.
    2 Schena M. Genome analysis with gene expression microarrays [J]. Bioessays,1996, 18(5):427-431.
    3 Schena M. Micrarray analysis [M]. John Wiley& Sons, Inc.2003.
    4 Schulze A, Downward J. Navigating gene expression using microarrays--a technology review [J]. Nat Cell Biol,2001,3(8):E190-195.
    5 Branden KV, Verboven S. Robust data imputation [J]. Comput Biol Chem,2009, 33(1):7-13.
    6 Irizarry RA, Hobbs B, Collin F et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data [J]. Biostatistics,2003,4(2):249-264.
    7 Rao Y, Lee Y, Jarjoura D et al. A comparison of normalization techniques for microRNA microarray data [J]. Stat Appl Genet Mol Biol,2008,7:Article22.
    8 Gentleman RC. Bioinformatics and computational biology solutions using R and Bioconductor [M]. Springer.2005.
    9 Do JH, Choi DK. Normalization of microarray data:single-labeled and dual-labeled arrays [J]. Mol Cells,2006,22:254-261.
    10 Fundel K, Kuffner R, Aigner T et al. Normalization and gene p-value estimation: issues in microarray data processing [J]. Bioinform Biol Insights,2008,2:291-305.
    11 Fu J, Jansen RC. Optimal design and analysis of genetic studies on gene expression [J]. Genetics,2006,172:1993-1999.
    12 Chu TM, Weir B, Wolfinger R. A systematic statistical linear modeling approach to oligonucleotide array experiments [J]. Math Biosci,2002,176(1):35-51.
    13 Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarray data [J]. Genet Res,2007,89:509-514.
    14 Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments [J]. Stat Appl Genet Mol Biol,2004,3:Article3.
    15 Allison DB, Cui X, Page GP et al. Microarray data analysis:from disarray to consolidation and consensus [J]. Nat Rev Genet,2006,7(1):55-65.
    16 Murie C, Woody O, Lee AY et al. Comparison of small n statistical tests of differential expression applied to microarrays [J]. BMC Bioinformatics,2009,10:45.
    17 Chiaretti S, Li X, Gentleman R et al. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival [J]. Blood,2004,103(7):2771-2778.
    18 Gentleman RC, Carey VJ, Bates DM et al. Bioconductor:open software development for computational biology and bioinformatics [J]. Genome Biol,2004,5(10):R80.
    19 Xiaochun Li. Data of T- and B-cell Acute Lymphocytic Leukemia from the Ritz Laboratory at the DFCI [EB/OL]. http://www.bioconductor.org/packages/2.4/data/experiment/html/ALL.html. Last modified Apr,2009.
    20 Singh D, Febbo PG, Ross K et al. Gene expression correlates of clinical prostate cancer behavior [J]. Cancer Cell,2002, 1(2):203-209.
    21 Broad Institute Cancer Program Publication [EB/OL]. http://www.broadinstitute.org/mpr/prostate/. Last modified 2008.
    22 Su LJ, Chang CW, Wu YC et al. Selection of DDX5 as a novel internal control for Q-RT-PCR from microarray data using a block bootstrap re-sampling scheme [J]. BMC Genomics,2007,8:140.
    23 Barrett T, Troup DB, Wilhite SE et al. NCBI GEO:archive for high-throughput functional genomic data [J]. Nucleic Acids Res,2009,37:D885-890.
    24 Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments [J]. Genome Biol,2003,4(4):210.
    25 Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response [J]. Proc Natl Acad Sci U S A,2001,98(9):5116-5121.
    26 Efron B, Tibshirani R, Goss V, Chu G. Microarrays and their use in a comparative experiment [R].2000.
    27 Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data:regularized t-test and statistical inferences of gene changes [J]. Bioinformatics, 2001,17(6):509-519.
    28 Smyth GK. Linear Models for Microarray Data [EB/OL]. http://www.bioconductor.org/packages/2.4/bioc/html/limma.html. Last modified Apr, 2009.
    29 Baldi. Cyber-T [EB/OL]. http://cybert.microarray.ics.uci.edu/. Last modified Feb, 2008.
    30 Jeffery IB, Higgins DG, Culhane AC. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data [J]. BMC Bioinformatics,2006,7:359.
    31 Cohen J. Citation-Classic-a Coefficient of Agreement for Nominal Scales [J]. Cc/Soc Behav Sci,1986(3):18-18.
    32 Landis JR., Koch GG. The measurement of observer agreement for categorical data [J]. Biometrics,1977,33:159-174.
    33 Kim SY, Lee JW, Sohn IS. Comparison of various statistical methods for identifying differential gene expression in replicated microarray data [J]. Stat Methods Med Res, 2006,15(1):3-20.
    34 Wolda H. Similarity Indexes, Sample-Size and Diversity [J]. Oecologia,1981, 50(3):296-302.
    35 Ashburner M, Ball CA, Blake JA et al. Gene ontology:tool for the unification of biology. The Gene Ontology Consortium [J]. Nat Genet,2000,25(1):25-29.
    36 KEGG:Kyoto Encyclopedia of Genes and Genomes [EB/OL]. http://www.genome.jp/kegg/. Last modified Dec,2009.
    37 The International RH Mapping Consortium. GeneMap:A New Gene Map of the Human Genome [EB/OL]. http://www.ncbi.nlm.nih.gov/genemap99/
    38 Lottaz C, Toedling J, Spang R. Annotation-based distance measures for patient subgroup discovery in clinical microarray studies [J]. Bioinformatics,2007, 23(17):2256-2264.
    39 Mistry M, Pavlidis P. Gene Ontology term overlap as a measure of gene functional similarity [J]. BMC Bioinformatics,2008,9:327.
    40 Bairoch A, Apweiler R, Wu CH et al. The Universal Protein Resource (UniProt) [J]. Nucleic Acids Res,2005,33(Database issue):D154-159.
    41 Resnik P. Semantic similarity in a taxonomy:An information-based measure and its application to problems of ambiguity in natural language [J]. J Artif Intell Res,1999, 11:95-130.
    42 Lin D. An information-theoretic definition of similarity [R].15th International Conference on Machine Learning. San Francisco, CA:1998.
    43 Cohen J. Citation-Classic-a Coefficient of Agreement for Nominal Scales [J]. Cc/Soc Behav Sci,1986(3):18-18.
    44 Pesquita C, Faria D, Bastos H et al. Metrics for GO based protein semantic similarity: a systematic evaluation [J]. BMC Bioinformatics,2008,9 Suppl 5:S4.
    45 Schlicker A, Albrecht M. FunSimMat:a comprehensive functional similarity database [J]. Nucleic Acids Res,2008,36(Database issue):D434-439.
    46 Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population [J]. Biometrika,1915,10(4):507-521
    47 Bishop CM. Pattern recognition and machine learning (Information Science and Statistics) [M]. Springer,2007:72.
    48 Benjamini Y, Hochberg Y. Controlling the False Discovery Rate-a Practical and Powerful Approach to Multiple Testing [J]. J Roy Stat Soc B Met,1995,57(1):289-300.
    49 Nieborowska-Skorska M, Hoser G, Rink L et al. Id1 transcription inhibitor-matrix metalloproteinase 9 axis enhances invasiveness of the breakpoint cluster region/abelson tyrosine kinase-transformed leukemia cells [J]. Cancer Res,2006,66(8):4108-4116.
    50 Wagner K, Zhang P, Rosenbauer F et al. Absence of the transcription factor CCAAT enhancer binding protein alpha results in loss of myeloid identity in bcr/abl-induced malignancy [J]. Proc Natl Acad Sci U S A,2006,103(16):6338-6343.
    51 Hakansson P, Nilsson B, Andersson A et al. Gene expression analysis of BCR/ABL 1-dependent transcriptional response reveals enrichment for genes involved in negative feedback regulation [J]. Genes Chromosomes Cancer,2008,47(4):267-275.
    52 Schultheis B, Carapeti-Marootian M, Hochhaus A et al. Overexpression of SOCS-2 in advanced stages of chronic myeloid leukemia:possible inadequacy of a negative feedback mechanism [J]. Blood,2002,99(5):1766-1775.
    53 Arredouani MS, Lu B, Bhasin M et al. Identification of the transcription factor single-minded homologue 2 as a potential biomarker and immunotherapy target in prostate cancer [J]. Clin Cancer Res,2009,15(18):5794-5802.
    54 Halvorsen OJ, Rostad K, Oyan AM et al. Increased expression of SIM2-s protein is a novel marker of aggressive prostate cancer [J]. Clin Cancer Res,2007,13(3):892-897.
    55 Kawasaki BT, Hurt EM, Kalathur M et al. Effects of the sesquiterpene lactone parthenolide on prostate tumor-initiating cells:An integrated molecular profiling approach [J]. Prostate,2009,69(8):827-837.
    56 Zu K, Bihani T, Lin A et al. Enhanced selenium effect on growth arrest by BiP/GRP78 knockdown in p53-null human prostate cancer cells [J]. Oncogene,2006, 25(4):546-554.
    57 Helgeson BE, Tomlins SA, Shah N et al. Characterization of TMPRSS2:ETV5 and SLC45A3:ETV5 gene fusions in prostate cancer [J]. Cancer Res,2008,68(1):73-80.
    58 Wang C, Tao W, Chen Q et al. SRD5A2 V89L polymorphism and prostate cancer risk: A meta-analysis [J]. Prostate,2009.
    59 Boger-Megiddo I, Weiss NS, Barnett MJ et al. V89L polymorphism of the 5alpha-reductase Type II gene (SRD5A2), endogenous sex hormones, and prostate cancer risk [J]. Cancer Epidemiol Biomarkers Prev,2008,17:286-291.
    60 The R Project for Statistical Computing [EB/OL]. http://www.r-project.org/. Last modified Dec,2009
    61 Tinn-R [EB/OL]. http://www.sciviews.org/Tinn-R/. Last modified Jun,2008.
    62 JGR-Java GUI for R [EB/OL]. http://jgr.markushelbig.org/JGR.html. Last modified Oct,2009.
    63 Emacs Speaks Statistics [EB/OL]. http://ess.r-project.org/. Last modified Dec,2009.
    64 Debian, The Universal Operating System [EB/OL]. http://www.debian.org/. Last modified Dec,2009.
    65中国医学科学院高性能计算中心[EB/OL]. http://hpcc.pumc.edu.cn/index.htm. Last modified Jun,2008.
    66 Jeffrey Horner. rapache:Web application development with R and Apache [EB/OL]. http://biostat.mc.vanderbilt.edu/rapache/. Last modified Feb,2009.
    67 The Apache HTTP Server Project [EB/OL]. http://httpd.apache.org/.Last modified Nov,2009.
    68 Eisen MB, Spellman PT, Brown PO et al. Cluster analysis and display of genome-wide expression patterns [J]. Proc Natl Acad Sci U S A,1998, 95(25):14863-14868.
    69 Giancarlo R, Scaturro D, Utro F. Computational cluster validation for microarray data analysis:experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer [J]. BMC Bioinformatics,2008,9:462.
    70 Yona G, Dirks W, Rahman S. Comparing algorithms for clustering of expression data: how to assess gene clusters [J]. Methods Mol Biol,2009,541:479-509.
    71 de Souto MC, Costa IG, de Araujo DS et al. Clustering cancer gene expression data:a comparative study [J]. BMC Bioinformatics,2008,9:497.
    72 Do JH, Choi DK. Clustering approaches to identifying gene expression patterns from DNA microarray data [J]. Mol Cells,2008,25:279-288.
    73 Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data [J]. Bioinformatics,2001,17(4):309-318.
    74 Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis [J]. Bioinformatics,2005,21(15):3201-3212.
    75 Gasch AP, Spellman PT, Kao CM et al. Genomic expression programs in the response of yeast cells to environmental changes [J]. Mol Biol Cell,2000, 11(12):4241-4257.
    76 Cho RJ, Campbell MJ, Winzeler EA et al. A genome-wide transcriptional analysis of the mitotic cell cycle [J]. Mol Cell,1998,2(1):65-73.
    77 Ideker T, Thorsson V, Ranish JA et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network [J]. Science,2001,292(5518):929-934.
    78 Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data [J]. Bioinformatics,2006,22(10):1259-1268.
    79 Kohonen T. Self-Organized Feature Maps [J]. J Opt Soc Am A,1985,2(13):P16-P16.
    80 KEGG PATHWAY Database:Wiring diagrams of molecular interactions, reactions, and relations [EB/OL]. http://www.genome.jp/kegg/pathway.html. Last modified Dec, 2009.
    81 Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling [J]. Proc Natl Acad Sci U S A,2000, 97(18):10101-10106.
    82 Raychaudhuri S, Stuart JM, Altman RB. Principal components analysis to summarize microarray experiments:application to sporulation time series [J]. Pac Symp Biocomput, 2000:455-466.
    83 Peterson LE. Partitioning large-sample microarray-based gene expression profiles using principal components analysis [J]. Comput Methods Programs Biomed,2003, 70(2):107-119.
    84 Chang WC. On Using Principal Components before Separating a Mixture of 2 Multivariate Normal-Distributions [J]. Appl Stat-J Roy St C,1983,32(3):267-275.
    85 Bellazzi R, Zupan B. Towards knowledge-based gene expression data mining [J]. J Biomed Inform,2007,40(6):787-802.
    86 Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association [J]. Bioinformatics,2007,23(2):257-258.
    87 Frohlich H, Speer N, Poustka A et al. GOSim--an R-package for computation of information theoretic GO similarities between terms and gene products [J]. BMC Bioinformatics,2007,8:166.
    88 Khatri P, Draghici S. Ontological analysis of gene expression data:current tools, limitations, and open problems [J]. Bioinformatics,2005,21(18):3587-3595.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700