生物序列的几何刻画及应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着各种模式生物基因组计划的蓬勃发展和相继完成,特别是人类基因组计划的顺利完成,生物学数据积累出现了前所未有的飞跃。伴随着这些生物数据的迅猛增长,生物信息学作为一门崭新的交叉学科运用而生并且得到了迅速的发展,正逐步成为21世纪自然科学的核心领域之一。它以数学、统计数、计算机科学为研究工具,以核酸、蛋白质等生物大分子为主要研究对象,对其进行科学的采集、存储、传递、检索、分析,进而探索生命的起源、生物的进化、生命本质等重大理论问题。
     生物信息学的研究内容十分丰富,主要有:序列比较、系统发育分析、基因预测、蛋白质结构预测、药物设计、生物化学模拟、整个基因组分析、RNA结构预测、序列重叠群装配、公共数据库和数据格式等等。本文我们主要在序列比较以及分子进化分析等方面进行了一些研究,主要研究成果有:
     在第二章中,我们基于CGR的思想,给出了RNA二级结构序列和蛋白质序列的2-D图形表示方法。避免了一些之前提出的生物大分子序列的图形表示模型的缺陷。同时我们分别用所提出的方法分析了不同序列的相似性,并构造了蛋白质序列的进化树。
     在第三章中,我们将三次样条函数光滑化后的曲线的曲率引入生物序列的相似性分析中,提出用曲线的曲率作为新的度量。并且我们以11种物种的β球蛋白基因和它的每一个外显子编码序列为例,分析了它们之间的相似性并构造了进化树。同时我们还研究了每一个外显子,发现第二个外显子所涵盖的生物信息要多一些。此方法具有准确性高,计算简单等优点。
     在第四章中,我们避免了上章中用光滑化后的近似结果的不精确性,提出了挠率的差分形式。我们把挠率的差分形式作为新的描述子来刻画蛋白质序列中的TOPstrings,然后我们分析了34条TOPS strings的相似性,并与基于Clustal X方法得到的结果做了一些比较,取得了比较好的结果。此方法同样也具有准确性高,计算简单等优点。
     在第五章中,我们不是单纯考虑曲线的一个特征量,而是把曲线的曲率和挠率两个特征量联合起来,作为一个新的度量,来分析DNA序列的相似性。应用此方法我们分析了11种物种的β球蛋白基因和它的每一个外显子,取得了比较好的结果。并且应用此方法我们对各种冠状病毒之间的亲缘关系进行了一系列的分析研究,并构造了它们的进化树。最后我们对比了以往常见的基于矩阵不变量的方法,从时间和数值结果对比上可以发现我们的方法要优越些,我们的方法过程简单,计算速度快。
With the active development and completion of the genome of some model organism, especially the completion of Human Genome Project, the biological data presents unprecedented leap. With the increasing of these biological data, Bioinformatics, as a new interdiscipline, has generated and obtained the rapid development. Now Bioinformatics is becoming one of the core domains of nature sciences in this century, which uses mathematics, statistics, computer science as the study tools, and takes nucleic acid, protein, and some biological macromolecule as the study object. The subject focuses on how to collect, store, transfer, search, analyze, and then to explore the life origin, biological evolution, life inbeing and some serious theory problems.
     The research area of Bioinformatics is very wide, which includes sequence comparison, phylogenetic analysis, gene prediction, protein structure prediction, drug design, biochemistry simulation, the whole genome analysis, RNA structure prediction, assembly sequence, public database, the database format, and so on. The dissertation mainly studied the sequence comparison and phylogenetic analysis. The main results obtained in this dissertation can be summarized as follows:
     In Chapter 2, based on the idea of CGR, a 2-D graphical representation method of RNA secondary structure sequences and protein sequences is given, which avoids some limitation occurred in some former graphical representation model of biological sequence. These methods are used to analyze the similarity and dissimilarity of different species, and the phylogenetic tree of protein sequences is constructed.
     In Chapter 3, we have used the curvatures of smoothed curves by theβ-spline function to analyze the similarity of the DNA sequences and proposed the curvatures as a new invariant. The proposed method is tested on two real data sets: the coding sequences ofβ-globin gene and all of their exons. Meanwhile, we find that the information ofβ-globin gene of 11 species contained in the second exon is richer than the other two exons. Our method is simple and has high veracity.
     In Chapter 4, to avoid the unprecise approximate results, we have proposed the difference form of torsion. Then the torsion is regarded as the new descriptor to numerically characterize TOPS string. Our analysis on 34 TOPS strings has indicated that the introduction of TOPS strings into evolution analysis is successful. This method is also simple and has high veracity.
     In Chapter 5, instead of merely considering one curve characterization, we have computed curvature and torsion of curves as one descriptor to numerically characterize DNA sequences. The new method was tested on three data sets: the coding sequences ofβ-globin gene and all of their exons, Using the method we have also analyzed coro-navirus genomes and constructed their phylogenetic tree. In order to comparize, we employ the matrix invariant method to perform the similarity analysis on the same data. It's obvious that our method performs faster and better results.
引文
[1]贺林.解码生命:人类基因组计划和后基因组计划.北京:科学出版社,2000.
    [2]朱浩.计算分子生物学导论.北京:科学出版社,2003.
    [3]Crick F.Central dogma of molecular biology.Nature,1970,227:561-563.
    [4]李振刚.现代分子生物学论纲.天津:天津科学技术出版社,2003.
    [5]袁春欣.核酸序列的图形表示理论及应用:博士学位论文.大连理工大学,2007.
    [6]欧阳曙光,贺福初.生物信息学:生物实验数据和计算技术结合的新领域.科学通报,1999,44:1457-1468.
    [7]张阳德.生物信息学.北京:科学出版社,2004.
    [8]裘娟萍,钱海丰.生命科学概论.北京:科学出版社,2004.
    [9]卢大儒.基因治疗.北京:化学工业出版社,2003.
    [10]陈竺,强伯勤,方福德.基因组科学与人类疾病.北京:科学出版社,2001.
    [11]赵文恩.生物化学.北京:化学工业出版社,2004.
    [12]陈启民,王金忠,耿运琪.分子生物学.天津:南开大学出版社,2001.
    [13]金凤燮.生物化学.北京:中国轻工出版社,2004.
    [14]阎隆飞,孙之荣.蛋白质分子结构.北京:清华大学出版社,1999.
    [15]罗静初.生物信息学概论.北京:北京大学出版社,2002.
    [16]Needleman S B,Wunsch C D.A general method applicable to the search for similarities in the amino acid sequence of two proteins.J.Mol.Biol.,1970,48:443-453.
    [17]Smith T F,Waterman M S.Identification of common molecular subsquences.J.Mol.Biol.,1970,147:195-197.
    [18]Altschul S F,Gish W,Miller W,etc.Basic local alignment search tool.J.Mol.Biol.,1990,215:403-410.
    [19]Lipman D J,Pearson W R.Rapid and sensitive protein similarity searches.Science,1985,227:1435-1441.
    [20]郭卫斌,施保昌,王能超.多重生物序列对准及其算法综述.高技术通讯,2001,6:96-102.
    [21]Smith T F,Waterman M S.Comparison of biosequences.Adv.Appl.Math.,1981,2:482-489.
    [22]Dayhoff M O,Schwartz R M,Orcutt B C.A model of evolutionary change in proteins.In Atlas of Protein Sequence and Structure Washington,DC:National Biomedical Research Foundation,1978,5(3):345-358.
    [23]Henikoff S,Henikoff J G.Amino acid substitution matrices from protein blocks.Proc.Natl.Acad.Sci.,1992,89:10915-10919.
    [24]Hamori E,Juskin J.H curves,a novel method of representation of nucleotide series especially suited for long DNA sequences.J.Biol.Chem.,1983,258:1318-1327.
    [25]Blaisdell B.A measure of the similarity of sets of sequences not requiring sequence alignment.Proc.Natl.Acad.Sci.,1986,83:5155-5159.
    [26]Li M,Badger J H,Chen X,etc.An information-based sequence distance and its application to whole mitochondrial genome phylogeny.Bioinformatics,2001,17:149-154.
    [27]Wu T J,Hsieh Y C,Li L A.Statistical measures of DNA sequence dissimilarity under markov chain models of base composition.Biometrics.,2001,57:441-448.
    [28]Pham T D,Zuegg J.A probabilistic measure for alignment-free sequence comparison.Bioinformatics,2004,20(18):3455-3461.
    [29]Saitou N,Nei M.The neighbor-joining method:a new method for reconstructing phylogenetic trees.Mol.Biol.Evol.,1987,4:406-425.
    [30]Felsenstein J.Evolutionary trees from DNA sequences:a maximum likelihood approach.J.Mol.Evol.,1981,17:368-376.
    [31]Felsenstein J.PHYLIP-phylogeny inference package(version 3.2).Cladistics,1989,5:164-166.
    [32]Camin J H,Sokal R R.A method for deducing branching sequences in phylogeny.Evolution,1965,19:311-326.
    [33]Stuart G W,Moffett K,Baker S.Integrated gene and species phylogenies from unaligned whole genome protein sequences.Bioinformatics,2002,18:100-108.
    [34]Stuart G W,Moffett K,Leader J J.A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes.Mol.Biol.Evol.,2002,19:554-562.
    [35]Otu H H,Sayood K.A new sequence distance measure for phylogenetic tree construction.Bioinformatics,2003,19:2122-2130.
    [36]Hao B,Qi J,Wang B.Prokaryotic phylogeny based on complete genomes without sequence alignment.Modern Physics Letters B,2003,17:91-94.
    [37]赵国屏等.生物信息学.北京:科学出版社,2002.
    [38]Olsen G J,Matsuda H,Hagstrom R.fastDNAml:a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood.Comput.Appl.Biosci.,1994,10(1):41-48.
    [39]Kumar S,Tamura K,Nei M.MEGA:Molecular evolutionary genetics analysis software for microcomputers.Comput.Appl.Biosci.,1994,10:189-191.
    [40]Gates M A.A simple way to look at DNA.J.Theor.Biol.,1986,119:319-328.
    [41]Nandy A.A new graphical representation and analysis of DNA sequence structure:Ⅰ.Methodology and application to globin genes.Curr.Sci.,1994,.66:309-314.
    [42]Leong P M,Morgenthaler S.Random walk and gap plots of DNA sequences.Comput.Appl.Biosci.,1995,11:503-511.
    [43]Nandy A,Nandy P.Graphical analysis of DNA sequence structure:Ⅱ.Relative abundance of nucleotides in DNA gene evolution and duplication.Curr.Sci.,1995,68:75-85.
    [44]Nandy A.Graphical analysis of DNA sequence structure:Ⅲ.Indications of evolutionary distinctions and characteristics of introns and exons.Curr.Sci.,1996,70:661-668.
    [45]Roy A,Raychaudhury C,Nandy A.A novel techniques of graphical representation and analysis of DNA sequences-A review.J.Biosci.,1998,23:55-71.
    [46]Ghosh S,Roy A,Adhya S,Nandy A.Identification of new genes in human chromosome 3 contig 7 by graphical representation technique.Curt.Sci.,2003,84:298-307.
    [47]Guo X F,Randic M,Basak S C.A novel 2-D graphical representation of DNA sequences of low degeneracy.Chem.Phys.Lett.,2001,350:106-112.
    [48]Liu Y C,Guo X F,Xu J,etc.Some notes on 2-D graphical representation of DNA sequence.J.Chem.Inf.Comput.Sci.,2002,42:529-533.
    [49]Liao B,Wang T M.A 2D graphical representation of DNA sequence.Chem.Phys.Lett.,2005,401:196-199.
    [50]Bai F L,Wang T M.A 2D graphical representation of protein sequences based on nucleotide triplet codons.Chem.Phys.Lett.,2005,413:458-462.
    [51]Jeffrey H J.Chaos game representation of gene structure.Nucleic Acids Res.,1990,18:2163-2170.
    [52]Wu Y H,Liew A W,Yah H,etc.DB-curve:a novel 2D method of DNA sequence visualization and representation.Chem.Phys.Lett.,2003,367:170-176.
    [53]Randic M,Vracko M,Lers N,Plavsic D.Novel 2-D graphical representation of DNA sequences and their numerical characterization.Chem.Phys.Lett.,2003,368:1-6.
    [54]Yao Y H,Wang T M.A class of new 2D graphical representation of DNA sequences and their application.Chem.Phys.Lett.,2004,398:318-323.
    [55]Randic M,Vracko M,Zupan J.Compact 2-D graphical representation of DNA.Chem.Phys.Lett.,2003,373:558-562.
    [56]Hamori E.Graphical representation of long DNA sequences by the methods of H curves,current results and future aspects.Bio.Techniques.,1989,7:710-720.
    [57]Zhang R,Zhang C T.Z curves,an intuitive tool for visualizing and analyzing DNA sequences.J.Biomol.Str.Dyn.,1994,11:767-782.
    [58]Randic M,Vracko M,Nandy A,Basak S C.On 3-D graphical representation of DNA primary sequence and their numerical characterization.J.Chem.Inf.Comput.Sci.,2000,40:1235-1244.
    [59]Li C,Wang J.On a 3-D representation of DNA primary sequences.Comb.Chem.High.T.Scr.,2004,7:23-27.
    [60]Yao Y H,Wang T M.Analysis of similarity/dissimilarity of DNA sequences based on a 3-D graphical representation.Chem.Phys.Lett.,2005,411:248-255.
    [61]Chun L,Jun W.Numerical characterization and similarity analysis of DNA sequences based on 2-D graphical representation of the characteristic sequences.Comb.Chem.High.T.Scr.,2003,6:795-799.
    [62]Chun L,Nannan T,Jun W.Directed graphs of DNA sequences and their numerical characterization.J.Theo.Biol.,2006,241:173-177.
    [63]Liao B,Wang T M.3-D graphical representation of DNA sequences and their numerical characterization.Journal of Molecular Structure(Theochem),2004,681:209-212.
    [64]Liao B,Zhang Y S,Ding K Q,Wang T M.Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation.Journal of Molecular Structure (Theochem),2005,717:199-203.
    [65]Randic M,Balaban A T.On a four-dimensional representation of DNA primary sequences.J.Chem.Inf.Comput.Sci.,2003,43:532-539.
    [66]Liao B,Tan M S,Ding K Q.A 4D representation of DNA sequences and its application.Chem.Phys.Lett.,2005,402:380-383.
    [67]Feng Z P,Zhang C T.A graphic representation of protein sequence and predicting the subcellular locations of prokaryotic proteins.Int.J.Biochem.Cell.Biol.,2002,34:298-307.
    [68]Liao B,Wang T M.A 3D graphical representation of RNA secondary structure.J.Biomol.Struc.Dynamics.,2004,21:827-832.
    [69]Bai F L,Zhu W,Wang T M.Analysis of similarity RNA secondary structure.Chem.Phys.Lett.,2005,408:258-263.
    [70]Liao B,Ding K Q,Wang T M.On a six-dimensional representation of RNA secondary structures.J.Biomol.Stru.Dyn.,2005,22:455-464.
    [71]Liao B,Ding K Q,Wang T M.On a seven-dimensional representation of RNA secondary structures.Lecture series on computer and computational science,2004,1:310-312.
    [72]Zupan J,Randic M.Algorithm for coding DNA sequences into "spectrum-like" and "zigzag" representations.J.Chem.Inf.Comput.Sci.,2005,45:309-313.
    [73]白凤兰.生物序列的图形表示及应用:博士学位论文.大连理工大学,2006.
    [74]Koper-Zwarthoff E C,Brederode F T,Walstra P,etc.Nucleotide sequence of the 3'-noncoding region of alfalfa mosaic virus RNA 4 and its homology with the genomic RNAs.Nucleic Acids Research,1979,7:1887-1900.
    [75]Scott S,Ge X.The complete nucleotide sequence of RNA-3 of citrus leaf rugose and citrus variegation ilarviruses.J.Gen.Virol.,1995,76:957-963..
    [76]Koper-Zwarthoff E C,Brederode F T,Walstra P,Bol J F.Nucleotide sequence of the putative recognition site for coat protein in the RNAs of alfalfa mosaic virus and tobacco streak virus.Nucleic Acids Research,1980,8:3307-3318.
    [77]Cornelissen B J,Jansen H,Zuidema D,Bol J F.Complete nucleotide sequence of tobacco streak virus RNA-3.Nucleic Acids Research,1984,12:2427-2437.
    [78]Alrefai R H,Shicl P J,Domier L L,etc.The nucleotide sequence of apple mosaic virus coat protein gene has no similarity with other bromoviridae coat protein genes.J.Gen.Virol.,1994,75:2847-2850.
    [79]Scott S,Ge X.The complete nucleotide sequence of the RNA-3 of lilac ring mottle ilarviruses.J.Gen.Virol.,1995,76:1801-1806.
    [80]Bachman E J,Scott S,Xin G,Vance V B.The complete nucleotide sequence of prune dwarf ilarvirus RNA-3:Implications for coat protein activation of genome replication in ilarviruses.Virology,1994,201:127-131.
    [81]Houser-Scott F,Baer M L,Liem K F,etc.Nucleotide sequence and structural determinants of specific binding of coat protein or coat protein peptides to the 3' untranslated region of alfalfa mosaic virus RNA 4.J.Gen.Virol.,1994,68:2194-2205.
    [82]EMBL/GenBank/DDBJ.databases.Accession No.,X86352.
    [83]Yao Y H,Wang T M.A class of 2-D graphical representations of RNA secondary structure and the analysis of similarity based on them.Journal of Computational Chemistry,2005,26(13):1339-1346.
    [84]Yao Y H,Wang T M.A 2D graphical representations of RNA secondary structure and the analysis of similarity/dissimilarity based on it.Journal of Molecular Structure (Theochem),2005,755:131-136.
    [85]Zhang C T.A symmetrical theory of DNA sequences and its applications.J.Theor.Biol.,1997,187:297-306.
    [86]Randic M,Zupan J,Balaban A T.Unique graphical representation of protein sequences based on nucleotide triplet codons.Chem.Phys.Lett.,2004,397:247-252.
    [87]Chun L,Jun W.New invariant of DNA sequences.J.Chem.Inf.Model.,2005,45:115-120.
    [88]Guo X F,Nandy A.Numerical characterization of DNA sequences in a 2-D graphical representation scheme of low degeneracy.Chem.Phys.Lett.,2003,369:361-366.
    [89]Nandy A.Investigation on evolutionary changes in base distributions in gene sequences.Internet.Ele.J.Mol.Des.,2002,1(10):545-558.
    [90]Nandy A,Nandy P,Basak S C.Quantitative descriptor for SNP related gene sequences.Internet.Ele.J.Mol.Des.,2002,1:367-373.
    [91]Nandy A,Nandy P.On the uniqueness of quantitative DNA difference descriptors in 2D graphical reprensentation models.Chem.Phys.Lett.,2003,368:102-107.
    [92]Randic M,Guo X,Basak S.On the characterization of DNA primary sequences by triplet of nucleic acid bases.J.Chem.Inf.Comput.Sci.,2001,41:619-626.
    [93]贺平安.DNA序列及蛋白质序列的分析比较:博士学位论文.大连理工大学,2003.
    [94]姚玉华.生物序列相似性分析的图形表示及其不变量方法:博士学位论文.大连理工大学.2006.
    [95]Zhang C T,Wang J,Zhang R.A novel method to calculate the G+C content of genomic DNA sequences.J.Biomol.Struc.Dyn.,2001,19(2):333-341.
    [96]彭家贵,陈卿.微分几何.北京:高等教育出版社,2002.
    [97]Randic M,Vracko M,Lers N,Plavsic D.Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation.Chem.Phys.Lett.,2003,371:202-207.
    [98]Liao B,Wang T M.New 2D graphical representation of DNA sequences.J.Comput.Chem.,2004,25:1364-1368.
    [99]Yuan C X,Liao B,Wang T M.New 3D graphical representation of DNA sequence and their numerical characterization.Chem.Phys.Lett.,2003,379:412-417.
    [100]彭群生,胡敏.蛋白质三维结构相似性比较方法综述.计算机辅助设计与图形学学报,2006,18(10):1465-1471.
    [101]Sternberg M J E,Thornton J M.On the conformation of proteins:the handedness of the connection between parallel beta strands.Journal of Molecular Biology,1977,110:269-283.
    [102]Flores T P,Moss D M,Thornton J M.An algorithm for automatically generating protein topology cartoons.Protein Engineering,1994,7(1):31-37.
    [103]Westhead D R,Hutton D C,Thornton J M.An atlas of protein toplogy cartoons available on the world wide web.Trends Biochem.Sci.,1998,23(1):35-36.
    [104]Westhead D R,Slidel T W F,Flores T P J,Thornton J M.Protein structural topology:automated analysis and diagrammatic representation.Protein Science,1999,8(4):897-904.
    [105]Johannissen L O,Taylor W R.Protein fold comparison by the alignment of topological strings.Protein Engineering,2003,16(12):949-955.
    [106]Gilbert D R,Rossello F,Valiente G,Veeramalai M.Alignment-free comparison of TOPS strings,london algorithmics and stringology.In:Daykin,J.,Mohamed,M.Steinhofel,K.(Eds.),London Algorithmics and Stringology,Texts in Algorithmics,2007,8:177-197.
    [107]Chew L P,Kedem K.Finding the consensus shape for a protein family.Algorithmica,2003,38(1):115-129.
    [108]Comtet L.Advanced Combinatorics.Dordrecht:D.Reidel Publishing Co.,1974.
    [109]Zheng W X,Chert L L,Ou H Y,Gao F,Zhang C T.Coronavirus phylogeny based on a geometic approach.Mol.Phyl.Evol.,2005,36:224-232.
    [110]Liao B,Liu Y S,Li R F,Zhu W.Coronavirus phylogeny based on triplets of nucleic acids bases.Chem.Phys.Lett.,2006,421:313-318.
    [111]Lai M M C,Holmes K V.Coronaviridae:the viruses and their replication.New York:In:Knipe,D.M.and Howley,P.M.(Eds.),Fields Virology,fourthed Lippincott Williams and Wilkins,2001.
    [112]Marra M A,Jones S J,Astell C R,etc.The genome sequence of the SARS-associated coronavirus.Science,2003,300:1399-1404.
    [113]Rota P A,Oberste M S,Monroe S S,etc.The genome sequence of the SARS-associated coronavirus.Science,2003,300:1394-1399.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700