生物序列分析中的非比对方法及其应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着数学与计算机技术的飞速发展和巨量生物学数据的不断积累,一门新兴的充满活力的交叉学科——计算分子生物学(Computational Molecular Biology)应运而生。计算分子生物学主要是研究生物学应用上具有计算复杂度的问题,它吸引了许多计算机学家、分子生物学家、数学家等积极投入研究。生物序列分析是计算分子生物学研究的核心内容,传统的分析方法主要是以序列比对方法为主,而随着“后基因组(post-genome)”时代的到来,生物序列分析的非比对方法作为对传统方法的补充和发展已逐渐成为计算分子生物学研究中的一个热点领域。本文在对传统的序列比对方法进行简要回顾的基础上,较系统地总结了已有的非比对方法并提出了一些新的非比对方法,然后针对一些具体的生物序列进行了分析研究。本文的主要工作包括以下几个方面:
     基于生物序列的概率向量表示,提出了一种新的距离度量——正规化欧氏距离,重构了两组蛋白质序列集CK35和SP86的二级结构分类,并利用ROC曲线和AUC值与传统的比对方法和其它距离度量得到的分类结果进行了比较。
     以生物序列L-联体为核心,给出了DNA序列的一种8D向量表示和高维向量表示,并根据滑动窗口不同的起始位置构造相关矩阵,选取相关矩阵的正规化最大特征值和Frobenius范数作为数值特征比较序列的相似性。作为应用,我们比较了十一个物种的β-球蛋白基因的第一个外显子的相似性;简单模拟了DNA序列高维向量表示及相关矩阵在数据库搜索方面的应用;重构了H5N1型禽流感病毒全基因组编码序列的种系进化树。
     基于L-联体在生物序列中出现的次数和位置,根据离散随机变量分布函数的定义提出了L-联体特征分布的概念,以此来反映L-联体的分布规律,揭示生物序列中所包含的生物信息。利用此特征分布我们研究了11个物种β-球蛋白第一个外显子的GC特征分布图;重构了24种冠状病毒全基因组序列,34种哺乳动物线粒体全基因组序列和40种跨膜蛋白序列的种系树。
With the rapid development of the mathematics and computer technologies and the continuous accumulation of the tremendous biological data, a new and active interdiscipline—Computational Molecular Biology comes into being. The research in computational molecular biology which has attracted plenty of computer scientists, molecular biologists, mathematicians and so on to devote to it, is mainly concerned with the problems involving the computational complex in the biological applications. Biological sequence analysis is the key content of the interdiscipline and the traditional methods for the analysis are chiefly based on alignment of the strings, while with the coming of the " post-genome" era, alignment-free methods of the sequence analysis as the complement and development of the alignment methods have become a hot research area of computational molecular biology. In this dissertation, we firstly simply review the alignment methods; secondly relatively systematically summarize the alignment-free methods and propose some new alignment-free methods; finally make the analysis for some species sequences using the novel methods. The main contents of this dissertation are listed as follows:
     Based on the vectors of L-tuple probabilities for biological sequences, we provide a novel distance measure-normalized Euclidean distance, and classify two sets of protein sequences-CK35 and SP86 according to protein secondary structures using the distance function. Further, we compare our method with other metrics and alignment methods via ROC (Receiver Operating Curve) analysis in order to assess the intrinsic ability of the methodology to discriminate and classify biological sequences and structures.
     Using L-tuples, we consider to construct three 8-components vectors and multivariate vectors for a DNA primary sequence, and by the different start positions of the sliding window, a set of related matrices are given. The normalized leading eigenvalues and Frobenius norm from the constructed matrices have been selected as the numerical characterizations. As applications, we compare the similarity and dissimilarity for exon 1 ofβ-globin genes belonging to eleven species; we simulate the search for similar sequences of a query sequence from a database of 39 library sequences by the multivariate vectors representations of DNA sequence; we reconstruct the phylogenetic trees of H5N1 avian influenza virus genomes.
     From the frequency and position of appearance of L-tuple in a biological sequence, we consider construction of a characteristic distribution of an L-tuple to reflect the biological information involved in the sequence. The graphs of characteristic distributions of dinucleotide GC for the coding sequences of the first exon ofβ-globin gene of eleven different species, and the construction of phylogenetic trees of twenty four coronavirus genomes, thirty four mitochondrial genomes and 40 G protein-coupled receptors illustrate the utility of the approach.
引文
[1]NCBI-GenBank Flat File Release 167.0.http://www.ncbi.nlm.nih.gov/Genbank/.
    [2]J.塞图宝,J.梅丹尼斯 著,朱浩等译.计算分子生物学导论.北京:科学出版社,2003.
    [3]Jiang T,Xu Y,Zhang M Q.Current topics computation molecular biology.北京:清华大学出版社,2002.
    [4]陈润生.生物信息学.生物物理学报,1999,15(1):5-12.
    [5]张春霆.生物信息学的现状与展望.世界科技研究与发展,2000,22:17-20.
    [6]陈润生.与生物信息学相关的两个前沿方向——非编码基因和复杂生物网.生物物理学报,2007,23(4):290-295.
    [7]程妍,刘仲林.计算生物学——一门充满活力的新兴交叉学科.科学学与科学技术管理,2006,3:11-15.
    [8]皮埃尔.巴尔迪,索恩.布鲁纳克著,张东晖等译.生物信息学——机器学习方法.北京:中信出版社,2003.
    [9]Lewin B.Genes Ⅶ.Oxford University Press,1999.
    [10]Deonier R C,Tavare S,Waterman M S.Computational Genome Analysis:An Introduction.Springer Verlag,New York,2005.
    [11]Needleman S B,Wunsch C D.A general method applicable to the search for similarities in the amino acid sequence of two proteins.J.Mol.Biol.,1970,48:443-453.
    [12]Smith T F,Waterman M S.Identification of common molecular subsequences.J.Mol.Biol.,1981,147:195-197.
    [13]Pearson W R,Lipman D J.Rapid and sensitive protein similarity searches.Science,1985,227:1435-1441.
    [14]Altschul S F,Gish W,Miller W,Myers E W,Lipman D J.Basic local alignment search tool.J.Mol.Biol.,1990,215:403-410.
    [15]Thompson J D,Higgins D,Gibson T.CLUSTAL W:improving the sensitivity of progressive multiple sequence alignment through sequence weighting,positions-specific gap penalties and weight matrix choice.Nucleic Acids Research,1994,22:4673-4680.
    [16]北京大学生物信息中心.多序列比对.http://www.cbi.pku.edu.cn/chinese/documents/bioinfor/overview/web4/1.html.
    [17]Dayhoff M,Schwartz R,Orcutt B.A model of evolutionary change in proteins.In Dayhoff,M.D.,ed.,Atlas of Protein Sequence and Structure,National Biomedical Research Foundation,Washington D.C.,1978,5(supplement 3):345-352.
    [18] Henikoff S, Henikoff J. Amino acid substition matrices from protein block. Proc. Natl. Acad. Sci. USA, 1992, 89:10915-10919.
    [19] Waterman M S. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall Press, London, 1995.
    [20] Edgar R. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 2004, 5:113.
    [21] Wallace I M, Blackshields G, Higgins D G. Multiple sequence alignments. Curr. Opin. Struc. Biol, 2005, 15:261-266.
    [22] Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics, 2003, 19:513-523.
    [23] Hamori E, Ruskin J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J. Biol. Chem., 1983, 258:1318-1327.
    [24] Hamori E. Novel DNA sequence representations. Nature, 1985, 314:585-586.
    [25] Jeffrey H J. Chaos game representation of gene structure. Nucleic Acids Res., 1990, 18:2163-2170.
    [26] Jeffrey H J. Chaos game visualization of sequences. Comput. Graph., 1992, 16:25-33.
    [27] Peng C K, Buldyrev S V, Glodberger A L, Havlin S, Sciortino F, Simons M, Stanley H E. Long-range correlations in nucleotide sequences. Nature, 1992, 356:168-170.
    [28] Zhang R, Zhang C T. Z curves, an intutive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn., 1994, 11:767-782.
    [29] Zhang C T, Zhang R. S Curve, A graphic representation of protein secondary structure sequence and its applications. Biopolymers, 2000, 53:539-549.
    [30] Zhang C T, Zhang R, Ou H. The Z curvedatabase: a graphic representation of genome sequences. Bioinformatics, 2003, 19:593-599.
    [31] Gates M A. A simple way to look at DNA. J. Theor. Biol., 1986, 119:319-328.
    [32] Leong P M, Mogenthaler S. Random walk and gap plots of DNA sequences. Comput. Appl. Biosci., 1995, 11:503-507.
    [33] Nandy A. A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Curr. Sci., 1994, 66:309-314.
    [34] Nandy A. Two-dimensional graphical representation of DNA sequences and intron-exon discrimination in intron-rich sequences. Comput. Appl. Biosci., 1996, 12:55-62.
    [35] Roy A, Raychaudhury C, Nandy A. Novel techniques of graphical representation and analysis of DNA sequences - A review. J. Biosci., 1998. 23:55-71.
    [36] Randic M, Vracko M, Nandy A, Basak S C. On 3-D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization. J. Chem. Inf. Comput. Sci., 2000, 40:1235-1244.
    [37] Randic M, Vracko M, Lers N, Plavsic D. Analysis of similarity/dissimilarity of DNA sequences based on a novel 2-D graphical representation.Chem.Phys.Lett.,2003,371:202-207.
    [38]Yau S S,Wang J,Niknejad A,Lu C,Jin N,Ho Y K.DNA sequence representation without degeneracy.Nucleic Acids Res.,2003,31:3078-3080.
    [39]Yuan C,Liao B,Wang T.New 3D graphical representation of DNA sequences and their numerical characterization.Chem.Phys.Lett.,2003,379:412-417.
    [40]Li C,Wang J.On a 3-D representation of DNA primary sequences.Comb.Chem.High T.Scr.,2004,7:23-27.
    [41]Liao B.A 2D graphical representation of DNA sequence.Chem.Phys.Lett.,2005,401:196-199.
    [42]Bai F L,Wang T M.On graphical and numerical representation of protein sequences.J.Biomol.Struct.Dyn.,2006,23:537-545.
    [43]Randic M.2-D Graphical representation of proteins based on physico-chemical properties of amino acids.Chem,Phys.Lett.,2007,440:291-295.
    [44]Liu N,Wang T M.Graphical representations for protein secondary structure sequences and their application.Chem.Phys.Lett.,2007,435:127-131.
    [45]Deschavanne P,Giron A,et al.Genomic signature:characterization and classification of species assessed by chaos game representation of sequences.Mol.Biol.Evol.,1999,16:1391-1399.
    [46]Almeida J,Carric J,Maretzek A,Noble P A,Fletcher M.Analysis of genomic sequences by Chaos Game Representation.Bioinformatics,2001,17:429-437.
    [47]郑捷.SVM几何修正法及其在DNA序列处理中的应用:博士学位论文.北京工业大学,2001,31-33.
    [48]张春霆.人与其他生物基因组若干重要问题的生物信息学研究.自然科学进展,2004,14:1367-1374.
    [49]Grigoriev A.Analyzing genomes with cumulative skew diagrams.Nucleic Acids Res.,1998,26:2286-2290.
    [50]Raychaudhury C,Nandy A.Indexing scheme and similarity measures for macromolecular sequences.J.Chem.Inf.Comput.Sci.,1999,39:243-247.
    [51]Guo Y,Wang T M.A new method to analyze the similarity of the DNA sequences.J.Mole.Struct.THEOCHEM,2008,853:62-67.
    [52]Randic M,Vracko M.On the similarity of DNA primary sequences.J.Chem.Inf.Comput.Sci.,2000,40:599-606.
    [53]Randic M,Vracko M,Lets N,Plavsic D.Novel 2-D graphical representation of DNA sequences and their numerical characterization.Chem.Phys.Lett.,2003,368:1-6.
    [54]Randic M,Balaban A T.On a four-dimensional representation of DNA primary sequences.J.Chem.Inf.Comput.Sci.,2003,43:532-539.
    [55]Li C,Wang J.New invariant of DNA sequences.J.Chem.Inf.Model.,2005,45:115-120.
    [56]Berger J,Mitra S,Carli M,Neri A.Visualization and analysis of DNA sequences using DNA walks.J.Franklin I.,2004,341:37-53.
    [57]Vinga S,Gouveia-Oliveira R,Almeida J.Comparative evaluation of word composition distances for the recognition of SCOP relationships.Bioinformatics,2004,20:206-215.
    [58]Blaisdell B E.A measure of the similarity of sets of sequences not requiring sequence alignment.Proc.Natl.Acad.Sci.USA,1986,83:5155-5159.
    [59]Burke J,Davison D,Hide W.d2 cluster:a validated method for clustering EST and full-length cDNA sequences.Genome Res.,1999,9:1135-1142.
    [60]Wu T J,Hsieh Y C,Li L A.Statistical Measure of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition.Biometrics,2001,57:441-448.
    [61]Petrilli P.Classification of protein sequences by their dipeptide composition.Comput.Appl.Biosci.,1993,9:205-209.
    [62]Pham T D,Zuegg J.A probabilistic measure for alignment-free sequence comparison.Bioinformatics,2004,20:3455-3461.
    [63]Kantorovitz M,Robinson G E,Sinha S.A statistical method for alignment-free comparison of regulatory sequences.Bioinformatics,2007,23:i249-i255.
    [64]Ferragina P,Giancarlo R,Greco V,Manzini G,Valiente G.Compression-based classification of biological sequences and structures via the Universal Similarity Metric:experimental assessment.BMC Bioinformatics,2007,8:252.
    [65]Randic M.Condensed Representation of DNA Primary Sequences.J.Chem.Inf.Comput.Sci.,2000,40:50-56.
    [66]Randic M,Guo X F,Basak S C.On the Characterization of DNA Primary Sequences by Triplet of Nucleic Acid Bases.J.Chem.Inf.Comput.Sci.,2001,41:619-626.
    [67]Feng J,Liu X,Wang T.Condensed representations of protein secondary structure sequences and their application.J.Biomol.Struct.Dyn.,2008,25:621-628.
    [68]He P,Wang J.Characteristic sequences for DNA primary sequence.J.Chem.Inf.Comput.Sci.,2002,42:1080-1085.
    [69]Dai Q,Liu X,Wang T.Numerical characterization of DNA sequences based on the k-step Markov chain transition probability.J.Comput.Chem.,2006,27:1830-1842.
    [70]Stuart G W,Moffett K,Leader J J.A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes.Mol.Biol.Evol.,2002,19:554-562.
    [71]Qi J,Wang B,Hao B I.Whole proteome prokaryote phylogeny without sequence alignment:A K-string composition approach.J.Mol.Biol.,2004,58:1-11.
    [72]蔡旭,方伟武,张文.基于线粒体全基因组的非比对方法比较.计算机与应用化学,2005,22(10):837-844.
    [73] Blaisdell B E. Average values of a dissimilarity measure not reqmring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system. J. Mol. Evol., 1989, 29:538-547.
    [74] Stuart G W, Moffett K, Baker S. Integrated gene and species phylogenies from un-aligned whole genome protein sequences. Bioinformatics, 2002, 18:100-108.
    [75] Torney D, Burks C, Davison D, Sirotkin K. Computation of d2: a measure of sequence dissimilarity. In George, I. and Bell,T.G.M. (eds), Computers and DNA: the proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, held December 12 to 16, 1988 in Santa Fe, New Mexico. Addison-Wesley, Redwood City, CA, 1990, pp.109-125.
    [76] Wu T J, Burke J P, Davison D B. A measure of DNA Sequence Dissimilarity Based on Mahalanobis Distance between Frequencies of Words. Biometrics, 1997, 53:1431-1439.
    [77] Fang W W. The disagreement degree of multi-person judgments in additive structure. Math. Soc. Sci., 1994, 28:85-111.
    [78] Dai Q, Wang T. Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'. BMC Bioinformatics, 2008, 9:394.
    [79] Chew L, Kedem K. Finding the consensus shape for a protein family. Algorithmica, 2003, 38:115-129.
    [80] Sierk M, Person W. Sensitivity and selectivity in protein structure comparison. Protein. Sci., 2004, 13:773-785.
    [81] Bradley A. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recog., 1997, 30:1145-1159.
    [82] Wolski W. pairseqsim: Pairwise sequence alignment and scoring algorithms for global, local and overlap alignment with affine gap penalty. http://www.bioconductor.org, 2007.
    [83] Karlin S, Burge C. Dinucleotide relative abundance extremes: A genomic signature. Trends. Genet., 1995, 11:283-290.
    [84] Zhang C T, Wang J, Zhang R. A novel method to calculate the G+C content of genomic DNA sequences. J. Biomol. Struct. Dyn., 2001, 19:333-341.
    [85] Randic M, Zupan J, Balaban A T. Unique graphical representation of protein sequences based on nucleotide triplet codons. Chem. Phys. Lett., 2004, 397:247-252.
    [86] Balaban A T, Plavsic D, Randic M. Dna invarants based on nonoverlapping triplets of nucleotide bases. Chem. Phys. Lett., 2003, 379:147-154.
    [87] Liao B, Wang T M. Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases. J. Chem. Inf. Comput. Sci., 2004, 44:1666-1670.
    [88] Wilhelm T, Nikolajewa S A. A new classification scheme of the genetic code. J.Mol.Evol., 2004, 59:598.
    [89]Bytautas L,Klein D J,Randic M,et al.DIMACS Series in Discrete Mathematics and Theoretical Computer Science,2000,51:39.
    [90]Gantmacher F.Chap.13 of Theory of Matrices,vol.Ⅱ.Chelsea Pub.,New York,1959.
    [91]Horn R,Johnson C.Chap.5 of Matrix Analysis.Cambridge University Press,1986.
    [92]Baldauf S L.Phylogeny for the faint of heart:a tutorial.Trends Genet.,2003,19(6):345-351.
    [93]常青,周开亚.分子进化研究中系统发生树的重建.生物多样性,1998,6(1):55-62.
    [94]Zuckerkandl E,Pauling L.Evolutionary divergence and convergence in proteins.In:Bryson V,Vogel H J(eds),Evolving genes and proteins.Academic Press,New York,1965,pp.97-166.
    [95]Zheng W X,Chen L L,Ou H Y,Gao F,Zhang C T.Coronavirus phylogeny based on a geometric approach.Mol.Phylogenet.Evol.,2005,36:224-232.
    [96]Sneath P,Sokal R.Numerical Taxonomy-the principles and practice of numerical classification.San Francisco:W.H.Freeman and Company,1973.
    [97]Saitou N,Nei M.The neighbor-joining method:a new method for reconstructing phylogenetic trees.Mol.Biol.Evol.,1987,4:406-425.
    [98]Durbin R,Eddy S R,Krogh A,Mitchison G.Biological Sequence Analysis:Probabilistic Models of Proteins and Nucleic Acids.Cambridge:Cambridge University Press.,1999.
    [99]Barry D,Hartigan J A.Statistical analysis of hominoid molecular evolution.Stat.Sci.,1987,2:191-210.
    [100]Felsenstein J.Evolutionary trees from DNA sequences:a maximum likelihood approach.J.Mol.Evol.,1981,17:368-376.
    [101]Camin J,Sokal R.A method for deducing branching sequences in phylogeny.Evolution,1965,19:311-326.
    [102]Huelsenbeck J P,Ronquist F.MRBAYES:Bayesian inference of phylogenetic trees.Bioinformatics,2001,17:754-755.
    [103]Li M,Badger J H,Chen X,Kwong S,Kearney P,Zhang H.An information-based sequence distance and its application to whole mitochondrial genome phylogeny.Bioinformatics,2001,17:149-154.
    [104]Otu H H,Sayood K.A new sequence distance measure for phylogenetic tree construction.Bioinformatics,2003,19:2122-2130.
    [105]高雷,戚继,孙健冬,郝柏林.原核生物系统发生学与分类学的一致性:组份矢量树与原核生物分类系统的详尽比.中国科学C辑,2007,37(4):389-401.
    [106]Nandy A,Basak S C,Gute B.Graphical representation and numerical characterization of H5N1 avian flu neuraminidase gene sequence.J.Chem.Inf.Model.,2007,47:945-951.
    [107]Liao B,Shan X,Zhu W,Li R.Phylogenetic tree construction based on 2D graphical representation.Chem.Phys.Lett.,2006,422:282-288.
    [108]Wan X F,Ren K J Tand Luo,Liao M,Zhang G H,et al.Genetic characterization of H5N1 avian influenza viruses isolated in southern china during the 2003-04 avian influenza outbreaks.Arch.Virol.,2005,150:1257-1266.
    [109]Felsenstein J.Phylip(phylogeny inference package),version 3.5c.1993.distributed by the author at http://evolution.genetics.washington.edu/phylip.html.
    [110]Hide W,Burke J,Davison D.Biological evaluation of d~2,an algorithm for high performance sequence comparison.J.Comp.Biol.,1994,1:199-215.
    [111]Rota P A,Oberste M,Monroe S,Nix W A,Campagnoli R,et al.Characterization of a.novel coronavirus associated with severe acute respiratory syndrome.Science,2003,300:1394-1399.
    [112]Gu W,Zhou T,Ma J,Sun X,Lu Z.Analysis of synonymous codon usage in SARS coronavirus and other viruses in the Nidovirales.Virus Res.,2004,101:155-161.
    [113]高雷,戚继,卫海滨,孙奕钢,郝柏林.冠状病毒和人类sars病毒的分子亲缘关系研究.科学通报,2003,48(12):1246-1250.
    [114]Page R D M.TreeView:an application to display phylogenetic trees on personal computers.Comp.Appl.Biosci.,1996,12:357-358.
    [115]Cao Y,Okada N,Hasegawa M.Phylogenetic position of guinea pigs revisited.Mol.Biol.Evol.,1997,14:461-464.
    [116]Liu N,Wang T.Protein-based phylogenetic analysis by using hydropathy profile of amino acids.FEBS Lett.,2006,580:5321-5327.
    [117]Horn F,Weare J,Beukers M,et al.GPCRDB:an information system for g proteincoupled receptors.Nucleic Acids Res.,1998,26:277-281.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700