原核与真核生物蛋白质编码区识别及基因组分析
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着人类基因组、模式生物基因组及微生物基因组计划的蓬勃发展,已有近二百种自由生物体全基因组完成测序,国际三大核酸序列数据库中的碱基数量呈指数形式增长。基因组序列测定之后,找出其中的蛋白质编码基因是进行基因组分析的第一步,在生物信息学研究中占有非常重要的地位。本论文主要致力于原核生物与真核生物及冠状病毒蛋白质编码基因识别以及基因组分析方面的工作。
    论文第一部分介绍了生物信息学的发展背景及主要研究内容、原核生物与真核生物基因的结构特点、主要的蛋白质编码基因识别算法以及DNA序列的Z曲线理论及应用。Z曲线理论是本文中我们分析原核生物和真核生物基因组的主要工具,因此对其做了较为详细的介绍。
    论文的第二部分是原核生物及冠状病毒的基因识别和分析。首先我们提出了一种方法从细菌、古细菌基因组中注释较好的已知基因出发训练参数,进而确定注释不完善的ORFs中可能不编码蛋白质的ORFs,在此基础上开发了一套细菌、古细菌基因识别软件ZCURVE_C并提供网上服务;我们还发现基因组的GC含量比进化上的亲缘关系对于细菌、古细菌的基因识别更为重要。其次,我们利用Z曲线方法参数少的优点,开发了专门适用于冠状病毒 (尤其是SARS冠状病毒) 的基因识别软件ZCURVE_CoV,并采用位置权重矩阵来预测3C-like和papain-like两种蛋白酶的剪切位点,开发出能预测冠状病毒多聚蛋白酶切位点的新版本。
    论文的第三部分是真核生物基因识别和基因组结构分析。首先,我们基于Z曲线的非窗口技术分析了拟南芥基因组的isochore结构,画出了拟南芥五条染色体的Z’曲线图。详细分析了2号染色体上找到的两个isochore,其中一个位于核仁组织区,另外一个是线粒体DNA插入片断,我们可以精确的确定它的大小和在染色体中的位置。其次,我们开发了基于Z曲线方法的真核生物从头预测基因识别软件Zcurve_E。该软件侧重于提取蛋白质编码序列在三个密码子位的全局统计学特征,具有参数少和通用性较强的优点。将Zcurve_E和当今识别效果较好的Genscan联合使用,可以部分降低Genscan的伪正率,得到更好的识别效果。
The fast increasing pace of human and other model organism genome-sequencing projects have provided us a large quantity of genome data, which leads to a great need for automatic genome annotation. One of the important tasks of annotation is to recognize protein-coding genes in prokaryotic and eukaryotic genomes. This paper describes some new approaches for recognizing protein-coding genes in bacterial and archaeal, coronavirus and eukaryotic genomes by using the Z curve method.
    The first part of the paper introduces the development of bioinformatics and the progress of computational gene-finding algorithms. The Z curve theory, which is the basic tool in analyzing prokaryotic and eukaryotic genomic sequences in this paper, is also presented in this section.
    The second part proposes some algorithms in the recognition of protein-coding genes in prokaryotic genomes. Since false positive prediction always exists in the annotation of microbial genomes, it is essential to confirm which ORF is coding and which is not. Starting from the known genes in the annotation file, we describe a method based on Z curve theory to recognize protein-coding genes in questionable ORFs. The average recognition accuracy of 57 bacterial and archaeal genomes is greater than 99%. A computer program, ZCURVE_C, has been developed and website service is provided. We also find that the genomic GC content of bacterial and archaeal genomes is more important than phylogenetic lineage in gene recognition. Finally, a new program to recognize genes in coronavirus genomes, especially suitable for SARS-CoV genomes, has been proposed. The improved system, ZCURVE_CoV 2.0, can predict the cleavage sites of viral proteinases in coronavirus polyproteins.
    The third part analyzes the genome structure of Arabidopsis thaliana and develops an ab initio eukaryotic gene recognition program. Using a windowless technique based on the Z curve method, the isochore structure of Arabidopsis thaliana genome has been explored. The position and size of a mitochondrial DNA insertion isochore has been precisely predicted. Its amino acid usage and codon preference shows different properties with genes in other regions. Furthermore, a new ab initio gene-finding software for eukaryotic organisms, Zcurve_E, has been proposed in this section. The new algorithm addresses
    
    
    global statistical features of protein-coding sequences by taking the frequencies of bases at three codon positions into account. Consequently, it gives better consideration to both typical and atypical cases. Compared with other gene-finding software, the present program has the merits of simplicity, universality and reliability. Joint applications of Zcurve_E with Genscan, which is probably the best software currently available for gene recognition in eukaryotic genomes, may lead to better results over any individual program.
引文
References
    [1] Fleischmann, R.D., Adams, M.D., White, O. et al., Whole-genome random sequencing and assembly of Haemophilus influenzae Rd, Science, 1995, 269: 496-512.
    [2] Kyrpides, N.C., Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide, Bioinformatics, 1999, 15: 773-774.
    [3] The C. elegans sequencing consortium, Genome sequence of the nematode C. elegans: a platform for investigating biology, Science, 1998, 282: 2012-2018.
    [4] Myers, E.W., Sutton, G.G., Delcher, A.L. et al., A whole-genome assembly of Drosophila, Science, 2000, 287: 2196-2204.
    [5] The Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana, Nature, 2000, 408: 796-815.
    [6] Goff, S.A., Ricke, D., Lan, T.H. et al., A draft sequence of the rice genome (Oryza sativa L. ssp. japonica), Science, 2002, 296: 92-100.
    [7] Lander, E.S., Linton, L.M., Birren, B. et al., Initial sequencing and analysis of the human genome, Nature, 2001, 409: 860-921.
    [8] Watanabe, Y., Yokobori S., Inaba T. et al., Introns in protein-coding genes in archaea, FEBS Lett., 2002, 510: 27-30.
    [9] Mathe, C., Sagot, M.F., Schiex, T. et al., Current methods of gene prediction, their strengths and weaknesses, Nucleic Acids Res., 2002, 30: 4103-4117.
    [10] Zhang, M.Q., Computational prediction of eukaryotic protein-coding genes, Nat Rev Genet., 2002, 3: 698-709.
    [11] Searls, D.B., Bioinformatics tools for whole genomes, Annu Rev Genomics Hum Genet., 2000, 1: 251-279.
    [12] Pearson, W.R., Miller, W., Dynamic programming algorithms for biological sequence comparison, Methods Enzymol., 1992, 210: 575-601.
    [13] Altschul, S., Madden, T. et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 1997, 25: 3389-3402.
    
    [14] Lipman, D.J., Pearson, W.R., Rapid and sensitive protein similarity searches, Science, 1985, 227: 1435-1441.
    [15] Smith, T.F., Waterman, M.S., Identification of common molecular subsequences, J Mol Biol., 1981, 147: 195-197.
    [16] Thompson, J.D., Higgins, D.G., Gibson, T.J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res., 1994, 22: 4673-4680.
    [17] Fickett, J.W., Tung, C.S., Assessment of protein coding measures, Nucleic Acids Res., 1992, 20: 6441-6450.
    [18] Guigo, R., Computational gene identification: an open problem, Comput Chem., 1997, 21: 215-222.
    [19] Fickett, J.W., Recognition of protein coding regions in DNA sequences, Nucleic Acids Res., 1982, 10: 5303-5318.
    [20] Fickett, J.W., The gene identification problem: an overview for developer, Comput Chem., 1996, 20: 103-118.
    [21] Guigo, R. Agarwal, P., Abril, J.F. et al., An assessment of gene prediction accuracy in large DNA sequences, Genome Res., 2000, 10: 1631-1642.
    [22] Frishman, D., Mironov, A., Mewes, H.W. et al., Combining diverse evidence for gene recognition in completely sequenced bacterial genomes, Nucleic Acids Res., 1998, 26: 2941-2947.
    [23] Shepherd, J.C.W., Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification, Proc Natl Acad Sci. USA, 1981, 78: 1596-1600.
    [24] Staden, R., McLachlan, A.D., Codon preference and its use in identifying protein coding regions in long DNA sequences, Nucleic Acids Res., 1982, 10: 141-156.
    [25] Bibb, M.J., Findlay, P.R., Johnson, M.W., The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences, Gene, 1984, 30: 157-166.
    [26] Fichant, G., Gautier, C., Statistical method for predicting protein coding regions in nucleic acid sequences, Comp Appl Biosci., 1987, 3: 287-295.
    [27] Arques, D.G., Michel, C.J., Periodicities in coding and non-coding regions of genes, J Theor Biol., 1990, 143: 307-318.
    
    [28] Tsonis, A.A., Elsner, J.B., Tsonis, P.A., Periodicity in DNA coding sequences: Implications in gene evolution, J Theor Biol., 1991, 151: 323-331.
    [29] Silverman, B.D., Linsker, R., A measure of DNA periodicity, J Theor Biol., 1986, 118: 295-300.
    [30] Tiwari, S., Ramachandran, S., Bhattacharya, A. et al., Prediction of probable genes by fourier analysis of genomic sequences, Comput Appl Biosci., 1997, 13: 263-270.
    [31] Yan, M., Lin, Z.S., Zhang, C.T., A new Fourier transform approach for protein coding measure based on the format of the Z curve, Bioinformatics, 1998, 8: 685-690.
    [32] Claverie, J.M., Bougueleret, L., Heuristic informational analysis of sequences, Nucleic Acids Res., 1986, 14: 179-196.
    [33] Rabiner, L.R., A tutorial on Hidden Markov Models and selected applications in speech recognition, Proc. IEEE, 1989, 77: 257-285.
    [34] Viterbi, A.J., Error bounds for convolutional codes and an asymptotically optimal decoding algorithm, IEEE Trans Informat Theory, 1967, IT-13, 260-269.
    [35] Forney, G.D., The Viterbi algorithm, Proc. IEEE, 1973, 61, 268-278.
    [36] Burge, C., Identification of complete gene structures in human genomic DNA, PhD thesis, 1997, Stanford University, Stanford, CA.
    [37] Burge, C., Karlin, S., Prediction of complete gene structures in human genomic DNA, J Mol Biol., 1997, 268: 78-94.
    [38] Borodovsky, M., McIninch, J., GenMark: Parallel gene recognition for both DNA strands, Computers chem., 1993, 17: 123-134.
    [39] Besemer, J., Borodovsky, M., Heuristic approach to deriving models for gene finding, Nucleic Acids Res., 1999, 27: 3911-3920.
    [40] Besemer, J., Lomsadze, A., Borodovsky, M., GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions, Nucleic Acids Res., 2001, 29: 2607-2618.
    [41] Salzberg, S.L., Delcher, A., Kasif, S., et al., Microbial gene identification using interpolated Markov models, Nucleic Acids Res., 1998, 26: 544-548.
    [42] Delcher, A.L., Harmon, D. Kasif, S., et al., Improved microbial gene identification with GLIMMER, Nucleic Acids Res., 1999, 27: 4636-4641.
    [43] Yada, T., ToToki, Y., Takagi, T., et al., A novel bacterial gene-finding system
    
    
    with improved accuracy in locating start codons, DNA Res., 2001, 30: 97-106.
    [44] Guo, F.B., Ou, H.Y., Zhang, C.T., ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes, Nucleic Acids Res., 2003, 31: 1780-1789.
    [45] Lukashin, A.V., Borodovsky, M., GeneMark.hmm: new solutions for gene finding, Nucleic Acids Res., 1998, 26: 1107-1115.
    [46] Kulp, D., Haussler, D., Reese, M.G. et al., A generalized hidden Markov model for the recognition of human genes in DNA, Proc Int Conf Intell Syst Mol Biol., 1996, 4: 134-142.
    [47] Reese, M.G., Eeckman, F.H., Kulp, D. et al., Improved splice site detection in Genie, J Comput Biol., 1997, 4: 311-323.
    [48] Krogh, A., Two methods for improving performance of an HMM and their application for gene finding, Proc Int Conf Intell Syst Mol Biol., 1997, 5: 179-186.
    [49] Henderson, J., Salzberg, S., Fasman, K.H., Finding genes in DNA with a Hidden Markov Model, J Comput Biol., 1997, 4: 127-141.
    [50] Stanke, M., Waack, S., Gene prediction with a hidden Markov model and a new intron submodel, Bioinformatics, 2003, 19: ii215-ii225.
    [51] Xu, Y., Mural, R.J., Uberbacher, E.C., Constructing gene models from accurately predicted exons: an application of dynamic programming, Comput Appl Biosci., 1994, 10: 613-623.
    [52] Salzberg, S., Delcher, A.L., Fasman, K.H. et al., A decision tree system for finding genes in DNA, J Comput Biol., 1998, 5: 667-680.
    [53] Zhang, M.Q., Identification of protein coding regions in the human genome by quadratic discriminant analysis, Proc Natl Acad Sci. USA, 1997, 94: 565-568.
    [54] Guigo, R., Knudsen, S., Drake, N. et al., Prediction of gene structure, J Mol Biol., 1992, 226: 141-157.
    [55] Salamov, A.A., Solovyev, V.V., Ab initio gene finding in Drosophila genomic DNA, Genome Res., 2000, 10: 516-522.
    [56] Allen, J.E., Pertea, M., Salzberg, S.L., Computational gene prediction using multiple sources of evidence, Genome Res., 2004, 14: 142-148.
    [57] Zhang, C.T., Zhang, R., Analysis of distribution of bases in the coding sequences by a diagrammatic technique, Nucleic Acids Res., 1991, 19: 6313-6317.
    
    [58] Zhang, R., Zhang, C.T., Z curves, an intuitive tool for visualizing and analyzing the DNA sequences, J Biomol Struct Dyn., 1994, 11: 767-782.
    [59] Zhang, C.T., Wang, J., Zhang, R., A novel method to calculate the G+C content of genomic DNA sequences, J Biomol Struct Dyn., 2001, 19: 333-341.
    [60] Zhang, R., Zhang, C.T., Identification of genomic islands in the genome of Bacillus cereus by comparative analysis with Bacillus anthracis, Physiological Genomics, 2003, 16:19-23.
    [61] Zhang, R., Zhang, C.T., A systematic method to identify genomic islands and its applications in analyzing the genomes of Corynebacterium glutamicum and Vibrio vulnificus CMCP6 chromosome I, Bioinformatics, 2004, 20, 612-622.
    [62] Wang, J., Zhang, C.T., Identification of protein-coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides, Eur J Biochem., 2001, 268: 4261-4268.
    [63] Chen, L.L., Zhang, C.T., Gene recognition from questionable ORFs in bacterial and archaeal genomes, J Biomol Struct Dyn., 2003, 21: 99-110.
    [64] Chen, L.L., Zhang, C.T., Seven GC-rich microbial genomes adopt similar codon usage patterns regardless of their phylogenetic lineages, Biochem Biophys Res Commun., 2003, 306: 310-317.
    [65] Zhang, C.T., Wang, J., Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve, Nucleic Acids Res., 2000, 28: 2804-2814.
    [66] Zhang, C.T., Wang, J., Zhang, R., Using a Euclid distance discriminant method to find protein coding genes in the yeast genome, Comput Chem., 2002, 26: 195-206.
    [67] Zhang, C.T., Zhang, R., An isochore map of the human genome based on the Z curve method, Gene, 2003, 317: 127-135.
    [68] Zhang, C.T., Zhang, R., Isochore structures in the mouse genome, Genomics, 2004, 83: 384-394.
    [69] Tech, M., Merkl, R., YACOP: Enhanced gene prediction obtained by a combination of existing methods, In Silico Biol., 2003, 3: 441-451.
    [70] Zhang, R., Zhang, C.T., Single replication origin of the archaeon Methanosarcina mazei revealed by the Z curve method, Biochem Biophys Res Commun., 2002, 297: 396-400.
    
    [71] Zhang, R., Zhang, C.T., Multiple replication origins of the archaeon Halobacterium species NRC-1, Biochem Biophys Res Commun., 2003, 302: 728-734.
    [72] Baldi, P., On the convergence of a clustering algorithm for protein-coding regions in microbial genomes, Bioinformatics, 2000, 16: 367-371.
    [73] Mardia, K.V., Kent, J.T., Bibby, J.M., Multivariate Analysis, Academic Press, London, UK, 1979.
    [74] Nierman, W.C., Feldblyum, T.V., Laub, M.T. et al., Complete genome sequence of Caulobacter crescentus, Proc Natl Acad Sci. USA, 2001, 98: 4136-4141.
    [75] Trifonov, E.N., Translation framing code and frame-monitoring mechanism as suggested by analysis of mRNA and 16S rRNA nucleotide sequences, J Mol Biol., 1987, 194: 643-652.
    [76] Pan, A., Dutta, C., Das, J., Codon usage in highly expressed genes of Haemophilus influenzae and Mycobacterium tuberculosis: translational selection versus mutational bias, Gene, 1998, 215: 405-413.
    [77] Gupta, S.K., Majumdar, S., Bhattacharya, T.K. et al., Studies on the relationships between the synonymous codon usage and protein secondary structural units, Biochem Biophys Res Commun., 2000, 269: 692-696.
    [78] Chiusano, M.L., Alvarez-Valin, F., Giulio, M.D., et al., Second codon positions of genes and the secondary structures of proteins, Relationships and implications for the origin of the genetic code, Gene, 2000, 261: 63-69.
    [79] Zhang, C.T., Chou, K.C., A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences, J Mol Biol., 1994, 238: 1-8.
    [80] Bultrini, E., Pizzi, E., Giudice, P.D. et al., Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster, Gene, 2003, 304: 183-192.
    [81] Muto, A., Osawa, S., The guanine and cytosine content of genomic DNA and bacterial evolution, Proc Natl Acad Sci. USA, 1987, 84: 166-169.
    [82] Mooers, A.?., Holmes, E.C., The evolution of base composition and phylogenetic inference, Trends in Ecology and Evolution, 2000, 9: 365-369.
    [83] 沈萍, 微生物学, 北京: 高等教育出版社, 2000.
    [84] Ikemura, T., Codon usage and tRNA content in unicellular and multicellular organisms, Mol Biol Evol., 1985, 2: 13-34.
    
    [85] Sharp, P.M., Devine, K.M., Codon usage and gene expression level in Dictyostelium discodeum: highly expressed genes do ‘prefer’ optimal codons, Nucleic Acids Res., 1989, 17: 5029-5039.
    [86] Anderson, S.G.E., Kurland, C.G., Codon preferences in free-living micro organisms, Microbiol Rev., 1990, 54: 198-210.
    [87] Wright, F., Bibb, M.J., Codon usage in the G+C-rich Streptomyces genome, Gene, 1992, 113: 55-65.
    [88] Gutierrez, G., Marquez, L., Marin, A., Preference for guanosine at first codon position in highly expressed Escherichia coli genes. A relationship with translational efficiency, Nucleic Acids Res., 1996, 24: 2525-2527.
    [89] Wang, J., The base contents of A, C, G or U for the three codon positions and the total coding sequences show positive correlation, J Biomol Struct Dyn., 1998, 16: 51-57.
    [90] Frank, A.C., Lobry, J.R., Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms, Gene, 1999, 238: 65-77.
    [91] Siemion, I.Z., Siemion, P.J., The informational context of the third base in amino acid codons, Biosystems, 1994, 33: 39-48.
    [92] Taylor, F.J., Coates, D., The code within the codons, Biosystems, 1989, 22: 177-187.
    [93] Dillon, W.R., Goldstein, M., Multivariate analysis, methods and applications, Wiley Press, New York, USA, 1984.
    [94] Peiris, J.S., Lai, S.T., Poon, L.L. et al., Coronavirus as a possible cause of severe acute respiratory syndrome, Lancet, 2003, 361: 1319-1325.
    [95] Ksiazek, T.G., Erdman, D., Goldsmith, C.S. et al., A novel coronavirus associated with severe acute respiratory syndrome, N Engl J Med., 2003, 348: 1953-1966.
    [96] Drosten, C., Gunther, S., Preiser, W. et al., Identification of a novel coronavirus in patients with severe acute respiratory syndrome, N Engl J Med., 2003, 348: 1967-1976.
    [97] Tsang, K.W., Ho, P.L., Ooi, G.C. et al., A cluster of cases of severe acute respiratory syndrome in Hong Kong, N Engl J Med., 2003, 348: 1977-1985.
    [98] Lee, N., Hui, D., Wu, A. et al., A major outbreak of severe acute respiratory syndrome in Hong Kong, N Engl J Med., 2003, 348: 1986-1994.
    [99] Poutanen, S.M., Low, D.E., Henry, B. et al., Identification of severe acute
    
    
    respiratory syndrome in Canada, N Engl J Med., 2003, 348: 1995-2005.
    [100] Rota, P.A., Oberste, M.S., Monroe, S.S. et al., Characterization of a novel coronavirus associated with severe acute respiratory syndrome, Science, 2003, 300: 1394-1399.
    [101] Marra, M.A., Jones, S.J., Astell, C.R. et al., The Genome sequence of the SARS-associated coronavirus, Science, 2003, 300: 1399-1404.
    [102] Qin, E., Zhu, Q.Y., Yu, M. et al., A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01), Chinese Science Bulletin, 2003, 48: 941-948.
    [103] Ziebuhr, J., Snijder, E.J., Gorbalenya, A.E., Virus-encoded proteinases and proteolytic processing in the Nidovirales, J Gen Virol., 2000, 81: 853-879.
    [104] von Heijne, G., A new method for predicting signal sequence cleavage sites, Nucleic Acids Res., 1986, 14: 4683-4690.
    [105] Jenkins, G.M., Holmes, E.C., The extent of codon usage bias in human RNA viruses and its evolutionary origin, Virus Res., 2003, 92: 1-7.
    [106] Schneider, T.D., Stephens, R.M., Sequence logos: a new way to display consensus sequences, Nucleic Acids Res., 1990, 18: 6097-6100.
    [107] von Grotthuss, M., Wyrwicz, L.S., Rychlewski, L., mRNA cap-1 methyl- transferase in the SARS genome, Cell, 2003, 113: 701-702.
    [108] Brierley, I., Jenner, A.J., Inglis, S.C., Mutational analysis of the "slippery- sequence" component of a coronavirus ribosomal frameshifting signal, J Mol Biol., 1992, 227: 463-479.
    [109] Nam, S.H., Copeland, T.D., Hatanaka, M. et al., Characterization of ribosomal frameshifting for expression of pol gene products of human T-cell leukemia virus type I, J Virol., 1993, 67: 196-203.
    [110] Chen, L.L., Ou, H.Y., Zhang, R. et al., ZCURVE_CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes, Biochem Biophys Res Commun., 2003, 307: 382-388.
    [111] Gao, F., Ou, H.Y., Chen, L.L. et al., Prediction of proteinase cleavage sites in polyproteins of coronaviruses and its applications in analyzing SARS-CoV genomes, FEBS Lett., 2003, 553: 451-456.
    [112] Macaya, G., Thiery, J.P., Bernardi, G., An approach to the organization of
    
    
    eukaryotic genomes at a macromolecular level, J Mol Biol., 1976, 108: 237-254.
    [113] Bernardi, G., Olofsson, B., Filipski, J. et al., The mosaic genome of warm blooded vertebrates, Science, 1985, 228: 953-958.
    [114] Bernardi, G., The human genome, organization and evolutionary history, Annu Rev Genet., 1995, 29: 445-476.
    [115] Bernardi, G., Isochores and the evolutionary genomics of vertebrates, Gene, 2000, 241: 3-17.
    [116] Oliver, J.L., Bernaola-Galvan, P., Carpena, P. et al., Isochore chromosome maps of eukaryotic genomes, Gene, 2001, 276: 47-56.
    [117] Montero, L.M., Salinas, J., Matassi, G. et al., Gene distribution and isochore organization in the nuclear genome of plants, Nucleic Acids Res., 1990, 18: 1859-1867.
    [118] Zoubak, S., Clay, O., Bernardi, G., The gene distribution of the human genome, Gene, 1996, 174: 95-102.
    [119] Fullerton, S.M., Bernardo Carvalho, A., Clark, A.G., Local rates of recombination are positively correlated with GC content in the human genome, Mol Biol Evol., 2001, 18: 1139-1142.
    [120] Tenzen, T., Yamagata, T., Fukagawa, T. et al., Precise switching of DNA replication timing in the GC content transition area in the human major histocompatibility complex, Mol Cell Biol., 1997, 17: 4043-4050.
    [121] Sharp, P.M., Averof, M., Lloyd, A.T. et al., DNA sequence evolution: the sounds of silence, Philos Trans R Soc Lond., B Biol Sci., 1995, 349: 241-247.
    [122] Meunier-Rotival, M., Soriano, P., Cuny, G. et al, Sequence organization and genomic distribution of the major family of interspersed repeats of mouse DNA, Proc Natl Acad Sci. USA, 1982, 79: 355-359.
    [123] Soriano, P., Meunier-Rotival, M., Bernardi, G., The distribution of interspersed repeats is non-uniform and conserved in the mouse and human genomes, Proc Natl Acad Sci. USA, 1983, 80: 1816-1820.
    [124] Nekrutenko, A., Li, W.-H., Assessment of compositional heterogeneity within and between eukaryotic genomes, Genome Res., 2000, 10: 1986-1995.
    [125] Oliver, J.L., Roman-Roldan, R., Perez, J. et al., SEGMENT: identifying compositional domains in DNA sequences, Bioinformatics, 1999, 15: 974-979.
    [126] Li, W., Bernaola-Galvan, P., Haghighi, F. et al., Applications of recursive
    
    
    segmentation to the analysis of DNA sequences, Comput Chem., 2002, 26: 491-510.
    [127] Carels, N., Bernardi, G., The compositional organization and the expression of the Arabidopsis genome, FEBS Lett., 2000, 472: 302-306.
    [128] Bernardi, G., Misunderstandings about isochors, Part I. Gene, 2001, 276: 3-13.
    [129] Li, W., Are isochore sequences homogeneous? Gene, 2002, 300: 129-139.
    [130] Lin, X., Kaul, S. Rounsley, S. et al., Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana, Nature, 1999, 402: 761-768.
    [131] Mayer, K., Schuller, C., Wambutt, R. et al., Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana, Nature, 1999, 402: 769-777.
    [132] Theologis, A., Ecker, J.R. Palm, C.J. et al., Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana, Nature, 2000, 408: 816-820.
    [133] Salanoubat, M., Lemcke, K. Rieger, M. et al., Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana, Nature, 2000, 408: 820-822.
    [134] Tabata, S., Kaneko, T., Nakamura, Y. et al., Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana, Nature, 2000, 408: 823-826.
    [135] Copenhaver, G.P., Nickel K., Kuromori, T. et al., Genetic definition and sequence analysis of Arabidopsis centromeres, Science, 1999, 286: 2468-2474.
    [136] Round, E.K., Flowers S.K., Richards, E., Arabidopsis thaliana centromere regions: genetic map positions and repetitive DNA structure, Genome Res., 1997, 9: 1045-1053.
    [137] Chinwalla, A.T., Cook, L.L., Delehaunty, K.D. et al., Initial sequencing and comparative analysis of the mouse genome, Nature, 2002, 420: 520-562.
    [138] Holt, R.A., Subramanian, G.M., Halpern, A. et al., The genome sequence of the malaria mosquito Anopheles gambiae, Science, 2002, 298:129-149.
    [139] Majoros, W.H., Pertea, M., Antonescu, C. et al., GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders, Nucleic Acids Res., 2003, 31: 3601-3604.
    [140] Pertea, M., Salzberg, S.L., Computational gene finding in plants, Plant Mol Biol., 2002, 48: 39-48.
    [141] Fieldhouse, D., Yazdani, F., Golding, G.B., Substitution rate variation in closely related rodent species, Heredity, 1997, 78: 21-31.
    [142] Gao, F., Zhang, C.T., Comparison of various algorithms for recognizing short
    
    
    coding sequences of human genes, Bioinformatics, 2004, 20: 673-681.
    [143] Burset, M., Guigo, R., Evaluation of gene structure prediction programs, Genomics, 1996, 34: 353-367.
    [144] Rogic, S., Mackworth, A.K., Ouellette, F.B., Evaluation of gene-finding programs on mammalian sequences, Genome Res., 2001, 11: 817-832.
    [145] Thanaraj, T.A., A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures, Nucleic Acids Res., 1999, 27: 2627-2637.
    [146] Usuka, J., Brendel, V., Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring, J. Mol. Biol., 2000, 297: 1075-1085.
    [147] Staden, R., Computer methods to locate signals in nucleic acids sequences, Nucleic Acids Res., 1984, 12: 505-519.
    [148] Zhang, M.Q., Marr, T.G., A weight array method for splicing signal analysis, Comput Appl Biosci., 1993, 9: 499-509.
    [149] 丁士晟, 多元分析方法及其应用, 长春: 吉林人民出版社, 2000.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700