组合多重证据促进真核生物基因结构预测
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
人类基因组计划的实施预示着现代生物学的发展进入到了组学的时代。当前,有近2,000个物种的基因组序列测定已经完成或者正在进行。基因组序列是一个物种进行一切生命活动的遗传与物质基础,解释和理解基因组序列的第一步是完整地注释其中参与编码蛋白质的基因
     有许多证据能够对基因组注释提供支持,包括表达序列标签(Expressed Sequence Tag,EST)、同源蛋白质、基因预测软件的结果、相近物种间的保守片段等。这些不同类型的证据既能够相互补充,同时它们之间又存在冲突。人工的基因组注释主要是通过对比EST与基因组序列,产生一个可靠的注释结果。然而人工的注释耗时耗资,而且EST数据量的大小和质量严重影响到注释的完整性。计算机的基因预测能够提供了一个便宜的具有互补性的初始注释。计算机的基因预测主要是使用统计的机器学习方法,虽然在过去的20年里取得了重大的进展,但仍然有些问题亟待解决。当使用到大尺度的基因组序列时,当前的基因预测程序预测假阳性仍然偏高,而且对于缺乏训练数据的新测序物种会产生一个高度不准确的结果。
     本论文提供了一个基于分值的方法组合不同类型的证据,产生一个具有代表性的基因组注释结果。组合的证据包括与EST和蛋白质数据库的比对结果与4个计算机基因预测软件(Genscan,Augustus,Fgenesh,Geneid)的结果。首先,使用非参数估计统计方法转换不同证据的原始分值,使得转换后的分值能够准确地反映该证据的信任程度。我们测试了4种非参数估计方法——经验分布,分段线性函数,核密度估计,局部多项式估计,结果显示局部多项式估计是最可靠的转换方法。然后,所有的证据通过使用Dempster-Shafer证据理论结合投票的方法进行组合和归一化。最后,使用动态规划方法组合所有的证据到一个完整的真核生物基因结构。由于动态规划的方法组合基因结构不依赖于训练数据,因此此方法同样适合于预测新测序的物种。
     根据上述算法开发了一个真核生物基因结构预测软件,命名为SCGPred(Score-based Combinational Gene Predictorl。该软件使用Perl语言编写,为开放源代码。本论文详细地描述了上述组合算法的实现,并使用3个大的数据集评估了该软件的性能。其中,两个数据集(人的完整的第22号染色体和ENCODE序列集)用于评估该软件的监督的方法,而完整的玉米黑粉菌基因组则用于评估非监督的方法。结果显示,和其他的基因预测软件相比,我们的方法在敏感度和精确度上都有较大的提高,尤其是外显子水平。我们还证明,当应用到新测序的物种时,我们的方法同样超过了其他的非监督方法。
     除了编码蛋白的基因,当前研究发现有一类基因编码微RNA(microRNA)。这类微RNA通过碱基互补的方式结合到mRNA(通常是转录因子基因)上阻止该mRNA的翻译,或者启动该mRNA的降解。因此,是一种重要的后转录调控机制。使用比较拟南芥和水稻基因组并结合RNA二级结构分析,我们成功地预测了96条拟南芥微RNA,并显示这些微RNA通过结合转录因子mRNA参与到多重的代谢和遗传通路。
The Human Genome Project (HGP) is a sign that we have entered an "omic" era in molecular biology field. To date, the determination of genome sequences of approximate 2,000 organisms has been sequenced or is ongoing. The first stage for interpreting and annotating the genomic data is to list the protein-coding genes and determine the exact exon-intron structure for every gene.
     There are many sources that can support evidence for annotating genomes, including the expressed sequence tags (EST), homologous proteins, computational gene predictions and the conservation among the closely organisms. The evidence from multiply sources is complementary and conflictive for the genome annotation. Although some model species have been annotated by the manual curators, the method is time-consuming and money-costing, and limited to annotate the genomes of model species. Therefore, the computational gene finding as the only solution has been carried out to produce an initial annotation, especially for most newly-sequenced species. The computational gene predictions have been made well progress in the last few years in terms of both methods and prediction accuracy measure, but the task still remains a significant challenge, especially for eukaryotes in which coding exons are usually separated by introns of vary length. The current gene predictors can produce results with a number of false positives when implementing in large genomic sequences. Moreover, computational gene finding in newly-sequenced genomes is especially difficult task due to the absence of a training set which is composed of abundant validated genes.
     In this thesis, we present a based-score method for predicting eukaryotic gene structures by combining multiply evidence generated from a diverse set of sources. The evidence includes the predictions of the four leading ab initio gene finders (Genscan, Augustus, Fgenesh and Geneid) and alignments to EST and protein databases. At first, the raw scores of evidence are transformed by the nonparametric estimation methods to the probabilistic ones that can reflect the likelihood that the evidence is correct. We tested the four methods (experience distributing, segment linear function, kernel density estimating and local polynomial regress), showing that local polynomial regress is the best method for score transformation. The evidence is then integrated and normalized by Dempster-Shafer theory of evidence and vote algorithm. Lastly, the normalized evidence is combined into a frame-consistent gene model by using dynamic programming. As dynamic programming is an unsupervised method, it can be used to predict genes in newly-sequenced organisms.
     Based on the models and algorithm described above, a computational program was designed, named as SCGPred (Score-based Combinational Gene Predictor). SCGPred was written as Perl language, and is open source based on GNU license. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with three datasets composed of large DNA sequences from human (the 22th chromosome and ENCODE sequence set) and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in contrast to the best of ab initio gene predictors. We also demonstrate that SCGPred can improve significantly prediction in novel genomes by combining several foreign gene finders with similarity alignments, and is superior to other unsupervised methods. As a result, SCGPred can be served as an alternative gene-finding tool for newly-sequenced eukaryotic genomes.
     Besides coding proteins, there is a large class of genes that code microRNAs. MicroRNAs, an abundant class of tiny non-coding RNAs, have emerged as negative regulators for translational repression or cleavage of target mRNAs by the manner of complementary base paring in plants and animals. By searching short complementary sequences between transcription factor open-reading frames and intergenic region sequences, and considering RNA secondary structures and the sequence conversation between the genomes of Arabidopsis and Oryza sativa, we detected 96 candidate Arabidopsis microRNAs. These candidate microRNAs were predicted to target 102 transcription factor genes that are classified as 28 transcription factor gene families, particularly those of DNA-binding transcription factor families, which imply that microRNAs might be involved in complex transcriptional regulatory networks for specifying individual cell types in plant development.
引文
1. Allen JE, Pertea M, Salzberg SL. Computational gene prediction using multiple sources of evidence. Genome Res. 2004.14(1):142-148
    2. Allen JE, Salzberg SL. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005. 21(18):3596-3603
    3. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997. 25(17):3389-3402
    4. Ambros V. microRNAs: tiny regulators with great potential. Cell. 2001. 107(7):823-826
    5. Bellman R. On the Theory of Dynamic Programming. Proc Natl Acad Sci U SA. 1952. 38(8):716-719
    6. Bellman R. Bottleneck Problems and Dynamic Programming. Proc Natl Acad Sci U S A. 1953. 39(9):947-951
    7. Bellman R, Glicksberg I, Gross O. On Some Variational Problems Occurring in the Theory of Dynamic Programming. Proc Natl Acad Sci U S A. 1953. 39(4):298-301
    8. Bellman R, Kalaba R, Middleton D. Dynamic Programming, Sequential Estimation and Sequential Detection Processes. Proc Natl Acad Sci U S A. 1961. 47(3):338-341
    9. Biemont C, Vieira C. Genetics: junk DNA as an evolutionary force. Nature. 2006. 443(7111):521-524
    10. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997. 268(1):78-94
    11. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998. 8(3):346-354
    12. Castillo-Davis CI. The evolution of noncoding DNA: how much junk, how much func? Trends Genet. 2005. 21(10):533-536
    13. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005. 437(7055):69-87
    14. Choo KH, Tong JC, Zhang L. Recent applications of Hidden Markov Models in computational biology. Genomics Proteomics Bioinformatics. 2004. 2(2):84-96
    15. Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001. 29(4):412-417
    16. Eddy SR. Noncoding RNA genes. Curr Opin Genet Dev. 1999. 9(6):695-699
    17. Eddy SR. What is dynamic programming? Nat Biotechnol. 2004. 22(7):909-910
    18. Giegerich R. A systematic approach to dynamic programming in bioinformatics. Bioinformatics. 2000.16(8):665-677
    19. Guigo R, Agarwal P, Abril JF, Burset M, Fickett JW. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 2000. 10(10):1631-1642
    20. Hongay CF, Grisafi PL, Galitski T, Fink GR. Antisense transcription controls cell fate in Saccharomyces cerevisiae. Cell. 2006.127(4):735-745
    21. Karplus K, Sjolander K, Barrett C, Cline M, Haussler D, Hughey R, Holm L, Sander C. Predicting protein structure using hidden Markov models. Proteins. 1997. Suppl 1:134-139
    22. Katayama S, Tomaru Y, Kasukawa T, Waki K, Nakanishi M, Nakamura M, Nishida H, Yap CC, Suzuki M, Kawai J et al. Antisense transcription in the mammalian transcriptome. Science. 2005. 309(5740): 1564-1566
    23. Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001.17 Suppl l:S140-148
    24. Lim LP, Burge CB. A computational analysis of sequence features involved in recognition of short introns. Proc Natl Acad Sci U S A. 2001. 98(20):11193-11198
    25. Mathe C, Sagot MF, Schiex T, Rouze P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002. 30(19):4103-4117
    26. Modrek B, Resch A, Grasso C, Lee C. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 2001. 29(13):2850-2859
    27. Mukherjee S, Mitra S. Hidden Markov Models, grammars, and biology: a tutorial. J Bioinform Comput Biol. 2005. 3(2):491-526
    28. Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigo R. Comparative gene prediction in human and mouse. Genome Res. 2003.13(1):108-117
    29. Parra G, Blanco E, Guigo R. GeneID in Drosophila. Genome Res. 2000. 10(4):511-515
    30. Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005. 33(Database issue):D501-504
    31. Rabiner LR. A tutorial on hidden Markov models and selected applications for speech recognition. Proc IEEE. 1989. 77:257-285
    32. Solovyev V, Salamov A. The Gene-Finder computer tools for analysis of human and model organisms genome sequences. Proc Int Conf Intell Syst Mol Biol. 1997. 5:294-302
    33. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003. 19 Suppl 2:11215-11225
    34. Taylor WR. Protein structure comparison using iterated double dynamic programming. Protein Sci. 1999. 8(3):654-665
    35. Taylor WR, Saelensminde G, Eidhammer I. Multiple protein sequence alignment using double-dynamic programming. Comput Chem. 2000. 24(1):3-12
    36. The Genome International Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001. 409(6822):860-921
    37. Waterman MS. Sequence alignments in the neighborhood of the optimum with general application to dynamic programming. Proc Natl Acad Sci U S A. 1983. 80(10):3123-3124
    38. Wong GK, Passey DA, Huang Y, Yang Z, Yu J. Is "junk" DNA mostly intron DNA? Genome Res. 2000.10(11): 1672-1678
    39. Zhang MQ. Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet. 2002. 3(9):698-709
    1. Alim S. Application of Dempster-Shafer Theory for Interpretation of Seismic Parameters. J Struct Eng-ASCE. 1988.114(9):2070-2084
    2. Allen JE, Pertea M, Salzberg SL. Computational gene prediction using multiple sources of evidence. Genome Res. 2004.14(1):142-148
    3. Barnett JA. Calculating Dempster-Shafer Plausibility. IEEE Trans Pattern Anal Mach Intell. 1991.13(6):599-602
    4. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997. 268(1):78-94
    5. Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics. 1996. 34(3):353-367
    6. Caselton WF, Luo WB. Decision-Making with Imprecise Probabilities -Dempster-Shafer Theory and Application. Water Resour Res. 1992. 28(12):3071-3083
    7. Cleveland WS. Robust Locally Weighted Regression and Smoothing Scatterplots. J Am Stat Assoc. 1979. 74(368):829-836
    8. Dempster AP. Upper and Lower Probability Inferences Based on a Sample from a Finite Univariate Population. Biometrika. 1967. 54:515-&
    9. Dempster AP. Upper and Lower Probabilities Generated by a Random Closed Interval. Annals of Mathematical Statistics. 1968. 39(3):957-&
    10. Fan JG, I. Local Polynomial Modelling and Its Applications London: Chapman & Hall. 1996.
    11. Fan JQ. Local Linear-Regression Smoothers and Their Minimax Efficiencies. Ann Stat. 1993. 21(1):196-216
    12. Frey PW. House Architecture Judgments - Bayesian, Dempster-Shafer, or Rule-Based Reasoning. Bulletin of the Psychonomic Society. 1986. 24(5):351-351
    13. Gabbay DM, Smets P: The transferable belief model for quantified belief representation. In: Handbook of Defeasible Reasoning and Uncertainty Management Systems vol 1. The Netherlands: Kluwer; 1998: 267-301.
    14. Hajek P, Harmanec D. On Belief Functions - (Present State of Dempster-Shafer Theory). Lect Notes Artif Int. 1992. 617:286-307
    15. Kennes R, Smets P. Fast Algorithms for Dempster-Shafer Theory. Lecture Notes in Computer Science. 1991. 521:14-23
    16. Krogh A. Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol. 1997. 5:179-186
    17. Murakami K, Takagi T. Gene recognition by combination of several gene-finding programs. Bioinformatics. 1998.14(8):665-675
    18. Prakasa Rao BLS. Nonparametric Functional Estimation Academic Press. 1983.
    19. Rogic S, Mackworth AK, Ouellette FB. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 2001. 11(5):817-832
    20. Rogic S, Ouellette BF, Mackworth AK. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics. 2002.18(8): 1034-1045
    21. Shafer G A Mathematical Theory of Evidence. Princeton: Princeton University Press. 1976.
    22. Smets P, Kennes R. The Transferable Belief Model. Artif Intell. 1994. 66(2):191-234
    23. Solovyev V, Salamov A. The Gene-Finder computer tools for analysis of human and model organisms genome sequences. Proc Int Conf Intell Syst Mol Biol. 1997. 5:294-302
    24. Stone CI. Consistent Nonparametric Regression. Ann Stat. 1977. 5(4):595-645
    25. Uberbacher EC, Mural RJ. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991. 88(24):11261-11265
    26. Yager RR. Arithmetic and Other Operations on Dempster-Shafer Structures. International Journal of Man-Machine Studies. 1986.25(4):357-366
    27. Yager RR. On the Dempster-Shafer Framework and New Combination Rules. INFORM SCIENCES. 1987.41(2):93-137
    28. Yager RR. Decision-Making under Dempster-Shafer Uncertainties. INT J GEN SYST. 1992. 20(3):233-245
    29. Zhang MQ. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci U S A. 1997. 94(2):565-568
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997. 25(17):3389-3402
    2. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997. 268(1):78-94
    3. Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics. 1996.34(3):353-367
    4. Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001. 29(4):412-417
    5. Gillespie T. Perl 5 how to: The definitive Perl programming problem solver -GIover,M. Library Journal. 1996.121(14):204-204
    6. Gillespie T. Perl 5 interactive course: Master Perl & earn a certificate of achievement - Orwant,J. Library Journal. 1996.121(18):104-104
    7. Gillespie T. Programming Perl - Wall,L. Library Journal. 1996. 121(20):138-138
    8. Gillespie T. Effective perl programming. Library Journal. 1998. 123(4): 121-121
    9. Gillespie T. Perl cookbook. Library Journal. 1998.123(20):147-147
    10. Gillespie T. Mastering algorithms with Perl. Library Journal. 1999. 124(18): 120-120
    11. Gillespie T. Object oriented Perl. Library Journal. 2000.125(2):112-112
    12. Gordon RS. SAMS teach yourself perl in 21 days. Library Journal. 2002. 127(18):125-125
    13. Gordon RS. SAMS teach yourself Perl in 24 hours. Library Journal. 2005. 130(18):108-108
    14. Guigo R, Agarwal P, Abril JF, Burset M, Fickett JW. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 2000. 10(10):1631-1642
    15. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006. 7 Suppl 1:S2 1-31
    16. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006. 34(Database issue):D590-598
    17. Horton RM, Tait RC. An introduction to Perl: the QuizMaster revisited. Biotechniques. 1999. 27(3):470-+
    18. Kamper J, Kahmann R, Bolker M, Ma LJ, Brefort T, Saville BJ, Banuett F, Kronstad JW, Gold SE, Muller O et al. Insights from the genome of the biotrophic fungal plant pathogen Ustilago maydis. Nature. 2006. 444(7115):97-101
    19. Keibler E, Brent MR. Eval: a software package for analysis of genome annotations. BMC Bioinformatics. 2003.4:50
    20. Kiesling R. Exploring Perl libraries. Dr Dobbs Journal. 2001. 26(2):84-+
    21. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004. 5:59
    22. Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J. The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res. 2005. 33(Database issue):D71-74
    23. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research. 2005. 33(20):6494-6506
    24. Munch K, Krogh A. Automatic generation of gene finders for eukaryotic species. BMC Bioinformatics. 2006. 7:263
    25. Paris D. Perl versus the World. Dr Dobbs Journal. 2001. 26(5):12-12
    26. Parra G, Blanco E, Guigo R. GeneID in Drosophila. Genome Res. 2000. 10(4):511-515
    27. Pocock MR, Hubbard T, Birney E. SPEM: a parser for EMBL style flat file database entries. Bioinformatics. 1998.14(9):823-824
    28. Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005. 33(Database issue):D501-504
    29. Rogic S, Mackworth AK, Ouellette FB. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 2001. 11(5):817-832
    30. Smith B. A Programming Perl. Byte. 1993.18(2):235-235
    31. Solovyev W, Salamov AA, Lawrence CB. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc Int Conf Intell Syst Mol Biol. 1995. 3:367-375
    32. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H et al. The bioperl toolkit: Perl modules for the life sciences. Genome Research. 2002.12(10):1611-1618
    33. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003.19 Suppl 2:11215-11225
    34. The ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004. 306(5696):636-640
    35. Thornton-Wells TA, Johnson KB. Perl programming for biologists. Journal of the American Medical Informatics Association. 2004.11(3): 173-173
    36. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006. 34(Database issue):D187-191
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997. 25(17):3389-3402
    2. Bartel B, Bartel DP. MicroRNAs: at the root of plant development? Plant Physiol. 2003.132(2):709-717
    3. Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004.116(2):281-297
    4. Bass BL. Double-stranded RNA as a template for gene silencing. Cell. 2000. 101(3):235-238
    5. Carrington JC, Ambros V. Role of microRNAs in plant and animal development. Science. 2003. 301(5631):336-338
    6. Chen X. A microRNA as a translational repressor of. APETALA2 in Arabidopsis flower development. Science. 2004. 303(5666):2022-2025
    7. Chen X. MicroRNA biogenesis and function in plants. FEBS Lett. 2005. 579(26):5923-5931
    8. Combier JP, Frugier F, de Billy F, Boualem A, El-Yahyaoui F, Moreau S, Vernie T, Ott T, Gamas P, Crespi M et al. MtHAP2-l is a key transcriptional regulator of symbiotic nodule development regulated by microRNA169 in Medicago truncatula. Genes Dev. 2006. 20(22):3084-3088
    9. Davidson EH. Genomic Regulatory Systems: Academic Press. 2001.
    10. De Rijk P, Wuyts J, De Wachter R. RnaViz 2: an improved representation of RNA secondary structure. Bioinformatics. 2003.19(2):299-300
    11. Emery JF, Floyd SK, Alvarez J, Eshed Y, Hawker NP, Izhaki A, Baum SF, Bowman JL. Radial patterning of Arabidopsis shoots by class III HD-ZIP and KANADI genes. Curr Biol. 2003. 13(20):1768-1774
    12. Grad Y, Aach J, Hayes GD, Reinhart BJ, Church GM, Ruvkun G, Kim J. Computational and experimental identification of C. elegans microRNAs. Mol Cell. 2003.11(5): 1253-1263
    13. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004. 32(Database issue):D109-111
    14. Heim MA, Jakoby M, Werber M, Martin C, Weisshaar B, Bailey PC. The basic helix-loop-helix transcription factor family in plants: a genome-wide study of protein structure and functional diversity. Mol Biol Evol. 2003. 20(5):735-747
    15. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast Folding and Comparison of Rna Secondary Structures. Monatshefte f Chemie. 1994.125(2):167-188
    16. Jones-Rhoades MW, Bartel DP. Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol Cell. 2004.14(6):787-799
    17. Kasschau KD, Xie Z, Allen E, Llave C, Chapman EJ, Krizan KA, Carrington JC. P1/HC-Pro, a viral suppressor of RNA silencing, interferes with Arabidopsis development and miRNA unction. Dev Cell. 2003.4(2):205-217
    18. Kidner CA, Martienssen RA. The developmental role of microRNA in plants. Curr Opin Plant Biol. 2005. 8(1):38-44
    19. Lai EC, Tomancak P, Williams RW, Rubin GM. Computational identification of Drosophila microRNA genes. Genome Biol. 2003. 4(7):R42
    20. Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Radmark O, Kim S et al. The nuclear RNase III Drosha initiates microRNA processing. Nature. 2003.425(6956):415-419
    21. Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN. MicroRNA genes are transcribed by RNA polymerase II. Embo J. 2004. 23(20):4051-4060
    22. Li X, Duan X, Jiang H, Sun Y, Tang Y, Yuan Z, Guo J, Liang W, Chen L, Yin J et al. Genome-wide analysis of basic/helix-loop-helix transcription factor family in rice and Arabidopsis. Plant Physiol. 2006.141(4):1167-1184
    23. Llave C, Xie Z, Kasschau KD, Carrington JC. Cleavage of Scarecrow-like mRNA targets directed by a class of Arabidopsis miRNA. Science. 2002. 297(5589):2053-2056
    24. Mallory AC, Dugas DV, Bartel DP, Bartel B. MicroRNA regulation- of NAC-domain targets is required for proper formation and separation of adjacent embryonic, vegetative, and floral organs. Curr Biol. 2004. 14(12):1035-1046
    25. McManus MT, Sharp PA. Gene silencing in mammals by small interfering RNAs. Nat Rev Genet. 2002. 3(10):737-747
    26. Millar AA, Waterhouse PM. Plant and animal microRNAs: similarities and differences. Funct Integr Genomics. 2005. 5(3):129-135
    27. Ng Kwang Loong S, Mishra SK. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. Rna. 2007. 13(2): 170-187
    28. Palatnik JF, Allen E, Wu X, Schommer C, Schwab R, Carrington JC, Weigel D. Control of leaf morphogenesis by microRNAs. Nature. 2003. 425(6955):257-263
    29. Park W, Li J, Song R, Messing J, Chen X. CARPEL FACTORY, a Dicer homolog, and HEN1, a novel protein, act in microRNA metabolism in Arabidopsis thaliana. Curr Biol. 2002.12(17):1484-1495
    30. Ptashne M. A Genetic Switch: Blackwell Publishing. 1992.
    31. Reinhart BJ, Weinstein EG, Rhoades MW, Bartel B, Bartel DP. MicroRNAs in plants. Genes Dev. 2002.16(13): 1616-1626
    32. Rhoades MW, Reinhart BJ, Lim LP, Burge CB, Bartel B, Bartel DP. Prediction of plant microRNA targets. Cell. 2002.110(4):513-520
    33. Schauer SE, Jacobsen SE, Meinke DW, Ray A. DICER-LIKE1: blind men and elephants in Arabidopsis development. Trends Plant Sci. 2002. 7(11):487-491
    34. Shannon CE. A mathematical theory of communication. Bell Syst Tech. 1948. 27:379-423
    35. Sunkar R, Zhu JK. Novel and stress-regulated microRNAs and other small RNAs from Arabidopsis. Plant Cell. 2004.16(8):2001-2019
    36. Tang G, Reinhart BJ, Bartel DP, Zamore PD. A biochemical framework for RNA silencing in plants. Genes Dev. 2003.17(1):49-63
    37. Wang XJ, Reyes JL, Chua NH, Gaasterland T. Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biol. 2004. 5(9):R65
    38. Zhang B, Pan X, Wang Q, Cobb GP, Anderson TA. Computational identification of microRNAs and their targets. Comput Biol Chem. 2006. 30(6):395-407
    39. Zhang B, Wang Q, Pan X. MicroRNAs and their regulatory roles in animals and plants. J Cell Physiol. 2007. 210(2):279-289
    1. Alexandersson M, Cawley S, Pachter L. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 2003.13(3):496-502
    2. Allen JE, Pertea M, Salzberg SL. Computational gene prediction using multiple sources of evidence. Genome Res. 2004.14(1):142-148
    3. Allen JE, Salzberg SL. JIGSAW: mtegration of multiple sources of evidence for gene prediction. Bioinformatics. 2005. 21(18):3596-3603
    4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997. 25(17):3389-3402
    5. Ansel KM, Lee DU, Rao A. An epigenetic view of helper T cell differentiation. Nat Immunol. 2003.4(7):616-623
    6. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000. 408(6814):796-815
    7. Bafna V, Huson DH. The conserved exon method for gene finding. Proc Int Conf Intell Syst Mol Biol. 2000. 8:3-12
    8. Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 2000.10(7):950-958
    9. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004.14(5):988-995
    10. Birney E, Durbin R. Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. Proc Int Conf Intell Syst Mol Biol. 1997. 5:56-64
    11. Blayo P, Rouze P, Sagot ME Orphan gene finding - an exon assembly approach. Theor Comput Sci. 2003. 290(3):1407-1431
    12. Borodovsky M, McIninch J. Genmark - Parallel Gene Recognition for Both DNA Strands. Comput Chem. 1993. 17(2):123-133
    13. Borodovsky M, Rudd KE, Koonin EV. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 1994.-22(22):4756-4767
    14. Bray N, Dubchak I, Pachter L. AVID: A global alignment program. Genome Res. 2003.13(1):97-102
    15. Brendel V, Kleffe J, Carle-Urioste JC, Walbot V. Prediction of splice sites in plant pre-mRNA from sequence properties. J Mol Biol. 1998. 276(1):85-104
    16. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997. 268(1):78-94
    17. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998. 8(3):346-354
    18. Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics. 1996. 34(3):353-367
    19. Choo KH, Tong JC, Zhang L. Recent applications of Hidden Markov Models in computational biology. Genomics Proteomics Bioinformatics. 2004. 2(2):84-96
    20. Chuang JS, Roth D. Gene recognition based on DAG shortest paths. Bioinformatics. 2001.17 Suppl l:S56-64
    21. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999. 27(23):4636-4641
    22. Dong S, Searls DB. Gene structure prediction by linguistic methods. Genomics. 1994. 23(3):540-551
    23. Eddy SR. Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1995. 3:114-120
    24. Eddy SR. Hidden Markov models. Curr Opin Struct Biol. 1996. 6(3):361-365
    25. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998. 14(9):755-763
    26. Fickett JW. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982.10(17):5303-5318
    27. Fields CA, Soderlund CA. gm: a practical tool for automating DNA sequence analysis. Comput Appl Biosci. 1990. 6(3):263-270
    28. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998. 8(9):967-974
    29. Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T. EUGENE'HOM: A generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res. 2003. 31(13):3742-3745
    30. Gelfand MS, Mironov AA, Pevzner PA. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A. 1996. 93(17):9061-9066
    31. Gelfand MS, Roytberg MA. Prediction of the exon-intron structure by a dynamic programming approach. Biosystems. 1993. 30(1-3):173-182
    32. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science. 2002. 296(5565):92-100
    33. Gotoh O. Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics. 2000.16(3): 190-202
    34. Guigo R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 1998. 5(4):681-702
    35. Guigo R, Knudsen S, Drake N, Smith T. Prediction of gene structure. J Mol Biol. 1992. 226(1):141-157
    36. Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouze P, Brunak S. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996. 24(17):3439-3452
    37. Henderson J, Salzberg S, Fasman KH. Finding genes in DNA with a Hidden Markov Model. J Comput Biol. 1997. 4(2):127-141
    38. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006. 34(Database issue):D590-598
    39. Hooper PM, Zhang H, Wishart DS. Prediction of genetic structure in eukaryotic DNA using reference point logistic regression and sequence alignment. Bioinformatics. 2000.16(5):425-438
    40. Huang X, Adams MD, Zhou H, Kerlavage AR. A tool for analyzing and annotating genomic sequences. Genomics. 1997. 46(1):37-45
    41. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al. The Ensembl genome database project. Nucleic Acids Res. 2002. 30(1):38-41
    42. Hutchinson GB, Hayden MR. The prediction of exons through an analysis of spliceable open reading frames. Nucleic Acids Res. 1992. 20(13):3453-3462
    43. Jiang J, Jacob HJ. EbEST: an automated tool using expressed sequence tags to delineate gene structure. Genome Res. 1998. 8(3):268-275
    44. Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs.-Genome Res. 2001. 11(5):889-900
    45. Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998.14(10):846-856
    46. Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V. Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. Nucleic Acids Res. 1996. 24(23):4709-4718
    47. Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V. GeneGenerator--a flexible algorithm for gene prediction and its application to maize sequences. Bioinformatics. 1998.14(3):232-243
    48. Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001.17 Suppl l:S140-148
    49. Kotlar D, Lavner Y. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res. 2003. 13(8):1930-1937
    50. Krogh A. Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol. 1997. 5:179-186
    51. Krogh A. Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res. 2000.10(4):523-528
    52. Krogh A, Mian IS, Haussler D. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 1994. 22(22):4768-4778
    53. Kulp D, Haussler D, Reese MG, Eeckman FH. A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol. 1996. 4:134-142
    54. Laub MT, Smith DW. Finding intron/exon splice junctions using INFO, INterruption Finder and Organizer. J Comput Biol. 1998. 5(2):307-321
    55 Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J. The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res. 2005. 33(Database issue):D71-74
    56. Lim LP, Burge CB. A computational analysis of sequence features involved in recognition of short introns. Proc Natl Acad Sci U S A. 2001. 98(20):11193-11198
    57. Lukashin AV, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998. 26(4): 1107-1115
    58. Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics. 2000. 16(11):1046-1047
    59. Miller G, Fuchs R, Lai E. IMAGE cDNA clones, UniGene clustering, and ACeDB: an integrated resource for expressed sequence information. Genome Res. 1997. 7(10):1027-1032
    60. Mott R. ESTGENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci. 1997.13(4):477-478
    61. Mukherjee S, Mitra S. Hidden Markov Models, grammars, and biology: a tutorial. J Bioinform Comput Biol. 2005. 3(2):491-526
    62. Murakami K, Takagi T. Gene recognition by combination of several gene-finding programs. Bioinformatics. 1998. 14(8):665-675
    63. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA et al. A whole-genome assembly of Drosophila. Science. 2000. 287(5461):2196-2204
    64. Nardone J, Lee DU, Ansel KM, Rao A. Bioinformatics for the 'bench biologist': how to find regulatory regions in genomic DNA. Nat Immunol. 2004. 5(8):768-774
    65. Novichkov PS, Gelfand MS, Mironov AA. Gene recognition in eukaryotic DNA by comparison of genomic sequences. Bioinformatics. 2001. 17(11):1011-1018.
    66. Pachter L, Batzoglou S, Spitkovsky VI, Banks E, Lander ES, Kleitman DJ, Berger B. A dictionary-based approach for gene annotation. J Comput Biol. 1999. 6(3-4):419-430
    67. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988. 85(8):2444-2448
    68. Pedersen AG, Baldi P, Brunak S, Chauvin Y. Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1996. 4:182-191
    69. Pertea M, Lin X, Salzberg SL. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 2001. 29(5): 1185-1190
    70. Rabiner LR. A tutorial on hidden Markov models and selected applications for speech recognition. Proc IEEE. 1989. 77:257-285
    71. Reese MG, Eeckman FH, Kulp D, Haussler D. Improved splice site detection in Genie. J Comput Biol. 1997. 4(3):311-323
    72. Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol. 1999. 285(5):2053-2068
    73. Rogic S, Ouellette BF, Mackworth AK. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics. 2002.18(8): 1034-1045
    74. Rogozin I, Milanesi L. Analysis of donor splice signals in different organisms. J Mol Evol. 1997. 45:50-59
    75. Rogozin IB, D'Angelo D, Milanesi L. Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences. Gene. 1999. 226(1): 129-137
    76. Rogozin IB, Milanesi L, Kolchanov NA. Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci. 1996. 12(3):161-170
    77. Rose RC, Juang BH. Hidden Markov models for speech and signal recognition. Electroencephalogr Clin Neurophysiol Suppl. 1996. 45:137-152
    78. Salzberg S, Chen X, Henderson J, Fasman K. Finding genes in DNA using decision trees and dynamic programming. Proc Int Conf Intell Syst Mol Biol. 1996. 4:201-210
    79. Salzberg SL. A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput Appl Biosci. 1997.13(4):365-376
    80. Salzberg SL, Pertea M, Delcher AL, Gardner MJ, Tettelin H. Interpolated Markov models for eukaryotic gene finding. Genomics. 1999. 59(1):24-31
    81. Schiex T, Moisan A, Rouze P. EuGene: an eukaryotic gene finder that combines several sources of evidence. Lecture Notes in Computer Science. 2001:111-125
    82. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003. 13(1):103-107
    83. Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W. PipMaker-a web server for aligning two genomic DNA sequences. Genome Res. 2000. 10(4):577-586
    84. Senawongse P, Dalby AR, Yang ZR. Predicting the phosphorylation sites using hidden Markov models and machine learning methods. J Chem Inf Model. 2005. 45(4):1147-1152
    85. Sharp PA, Burge CB. Classification of introns: U2-type or U12-type. Cell. 1997. 91(7):875-879
    86. Snyder EE, Stormo GD. Identification of protein coding regions in genomic DNA. J Mol Biol. 1995. 248(1):1-18
    87. Solovyev V, Salamov A. The Gene-Finder computer tools for analysis of human and model organisms genome sequences. Proc Int Conf Intell Syst Mol Biol. 1997. 5:294-302
    88. Solovyev VV, Salamov AA, Lawrence CB. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc Int Conf Intell Syst Mol Biol. 1995. 3:367-375
    89. Stanke M, Schoffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006. 7:62
    90. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003.19 Suppl 2:II215-II225
    91. The Genome International Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001. 409(6822):860-921
    92. Thomas A, Skolnick MH. A probabilistic model for detecting coding regions in DNA sequences. IMA J Math Appl Med Biol. 1994. 11(3):149-160
    93. Tolstrup N, Rouze P, Brunak S. A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. Nucleic Acids Res. 1997. 25(15):3159-3163
    94. Uberbacher EC, Mural RJ. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991. 88(24):11261-11265
    95. Usuka J, Brendel V. Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol. 2000. 297(5): 1075-1085
    96. Usuka J, Zhu W, Brendel V. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics. 2000.16(3):203-211
    97. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al. The sequence of the human genome. Science. 2001. 291(5507):1304-1351
    98. Viterbi A. Error bounds for convolutional codes and an asymptotically optimal decoding, algorithm. IEEE Trans Informat Theor. 1967. IT-13:260-269
    99. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004. 5(4):276-287
    100. Wheelan SJ, Church DM, Ostell JM. Spidey: a tool for mRNA-to-genomic alignments. Genome Res. 2001.11(11): 1952-1957
    101. Wiehe T, Gebauer-Jung S, Mitchell-Olds T, Guigo R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res. 2001.11(9): 1574-1583
    102. Wu TD. A segment-based dynamic programming algorithm for predicting gene structure. J Comput Biol. 1996. 3(3):375-394
    103. Xu Y, Mural RJ, Uberbacher EC. Constructing gene models from accurately predicted exons: an application of dynamic programming. Comput Appl Biosci. 1994.10(6):613-623
    104. Xu Y, Uberbacher EC. Automated gene identification in large-scale-genomic sequences. J Comput Biol. 1997. 4(3):325-338
    105. Zhang MQ. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci U S A. 1997. 94(2):565-568
    106. Zhang MQ. Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet. 2002.3(9):698-709
    107. Zhang MQ, Marr TG. A weight array method for splicing signal analysis. Comput Appl Biosci. 1993. 9(5):499-509

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700