基于树模型的RNA序列结构比对算法研究

英文题名：The Research on RNA Sequence-structure Alignment Algorithm Based on Tree Model
作者：陆斌
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：RNA序列结构比对 ; 树模型 ; 动态规划 ; T-coffee
英文关键词：RNA sequence-structure alignment ; tree model ; dynamic programming ; T-coffee
学位年度：2009
导师：骆志刚
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2009-11-01

摘要

RNA序列结构比对是生物信息学的基础研究内容之一。通过对RNA序列和结构进行相似度比较,人们可以发现RNA序列中蕴含的功能和进化信息,对RNA序列分类、二级结构预测、发现序列保守区域都具有极其重要的研究意义。
     本文首先介绍了RNA序列结构的基本知识,给出了RNA序列结构比对问题的详细描述,分析了已有的RNA序列结构比对算法,对现有算法进行了分析和比较,并指出了当前比对算法存在的主要问题。接着详细阐述了树模型的构造和操作,给出了RNA序列结构与树模型的对应关系。
     针对目前低相似度RNA序列比对结果准确度不高的问题,本文基于RNA树形结构模型,提出了一种基于动态规划思想的RNA双序列结构比对算法,对算法进行了设计和实现。通过比较与其他比对算法在同一数据集上的运行结果,本文算法在低相似度RNA序列比对上表现出较高的准确度。在RNA双序列结构比对的基础上,本文结合T-coffee算法思想,设计并实现了RNA多序列结构比对算法,通过数值实验讨论了RNA多序列比对结果与序列数目及序列平均相似度的关系,同时验证了本文算法的有效性。
RNA sequence-structure alignment is one of the basic research contents in bioinformatics. Through the similarity comparison of RNA sequences and structures, people are able to find the functional and evolutional information hiding in RNA sequences. This has vital researching significance in the classification of RNA sequences, the prediction of RNA secondary structures and finding conserved regions in sequences.
     The paper first introduces the basic knowledge of the RNA sequence and structure, gives the detailed description of RNA sequence-structure alignment problems, analysis the existing RNA sequence-structure algorithms, and then gives analysis and comparison of present algorithms and points out the main problems of the present algorithms, describes the constructing and operating on the tree model in detail, give the relation between RNA sequence-structure and tree model.
     Considering the unsatisfied accuracy of current sequence-structure alignment algorithms on low similar RNA sequences, this paper presents a pairwise RNA sequence-structure alignment algorithm based on dynamic programming using the tree model, design and implement the algorithm. The comparison with other algorithms on the same data sets shows that the algorithm of this paper can give a higher accuracy on the low similar RNA sequences. Furthermore, Based on the ideas of T-coffee, the pair-wise RNA sequence-structure algorithm of this paper is extended to a multiple RNA sequence-structure alignment algorithm. Then, the relation between the results of the multiple RNA sequence-structure alignment and the number of sequences, average sequence similarity is discussed and the validity of our algorithm is verified.

引文

[1]Attwood T.K., Parry-Smith D.J.,罗静初等译.生物信息学概论.北京：北京大学出版社.2002.9：1～9.
    [2]Dan E. Krane, Michael L. Raymer著,孙啸,陆祖宏,谢建明等译.生物信息学概论.北京：清华大学出版社,2004.2：2～10.
    [3]肖筱.核酸序列分析方法的研究与实现.四川师范大学硕士学位论文.2009.4.
    [4]Bauer M., Klau G.W., Reinert K.. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.BMC Bioinformatics.2007,8(271):32～66.
    [5]Althaus E., Caprara A., Lenhof H.P.. Multiple sequence alignment with arbitrary gap costs:Computing an optimal solution using polyhedral combinatorics .Bioinformatics.2002,18(2):4～16.
    [6]Davydov E., Batzoglou E.. A computational model for RNA multiple structural alignment. Theoretical Computer Science,2006,368:205～216.
    [7]Fertin G., Hermelin D., Rizzi R., Vialette S.. Common Structured Patterns in Linear Graphs:Approximation and Combinatorics. In:CPM 2007 LNCS 4580. Berlin: Springer-Verlag 2007,241～252.
    [8]Kubica M., Rizzi R., Vialette S. et al. Approximation of RNA multiple structure alignment. In:CPM2006,LNCS 2009.2006:211～222.
    [9]Herrbach C., Denise A., Dulucq S. et al. Alignment of RNA secondary structure using a full set of Operations.LRI Research Report 1451.2006.
    [10]Zhang K., Shasha D.. Simple fast algorithms for the editing distance between trees and related problems.SIAM J comput.1989,18(6):1245～1262.
    [11]Klein P.N.. Computing the edit-distance between unrooted ordered trees.In: European symposium on algorithms. LNCS 1461. Berlin:Springer-Verlag. 1998,91～102.
    [12]Zhang K., Jiang T.. Some MAX SNP-hard results concerning unordered labeled trees. Information Processing Letters.1994,49:249～254.
    [13]Zhang K.. A constrained edit distance between unordered labeled trees. Algorithmica.1996 15:205～222.
    [14]Jiang T., Wang L., Zhang K.. Alignment of trees-an alternative to tree edit.Theoretical computer Science.1995.143:137～148.
    [15]Hochsmann M., Toller T., Giegerich R..et.al. Local Similarity in RNA secondary structures.In:Proceedings of the Computational Systems Bioinformatics.2003:159～168.
    [16]Backofen R., Hermelin D., Landau G.M. et.al. Local alignment of RNA sequences with arbitrary scoring schemes.In:CPM 2006,LNCS 4009.Berlin:Springer-Verlag. 2006:246～257.
    [17]Lin G.H., Ma B., Zhang K.. Edit distance between two RNA structures. In: Proceedings of the 5th annual international conference on computational molecular biology.ACM Press.2001:211～220.
    [18]Jiang T., Lin G., Ma B..et al. A general edit distance between RNA structures. Journal of Computational Biology.2002,9(2):371～388.
    [19]Siebert S., Backofen R.. MARNA:multiple alignment and consensus structure prediction of RNAs based on sequence structure comparison. Bioinformatics.2005, 21(16):3352～3359.
    [20]Evans P.A.. Finding common subsequences with arcs and pseudoknots.In:CPM 99,LNCS 1645. Berlin:Springer-Verlag.1999,270～280.
    [21]Jiang T., Lin G., MA B. et al. The longest common subsequence problem for arc-annotated sequences. Journal of discrete algorithms.2004,2:257～270.
    [22]Yoon B.J., Vaidyanathan PP.HMM with auxiliary memory:a new tool for modeling RNA secondary Structures. In:Proc 28th Asilomar Conference on Signals, Systems and computers.2004.
    [23]Sakakibara Y.. Grammatical inference in Bioinformatics. IEEE transactions on pattern analysis and machine intelligence.2005,27(7):1051～1062.
    [24]Sakakibara Y. Pair hidden Markov models on tree structures. Bioinformatics.2003, 19(1):232～240.
    [25]Holmes I., Rubin G.M.. Pairwise RNA structure comparison with stochastic context-free Grammars. In:Proceeding of 5th Pacific Symposium on Biocomputing. 2002,7:163～174.
    [26]Eddy S.R..A memory-efficient dynamic programming algorithm for optimal alignment of a Sequence to an RNA secondary structure. BMC Bioinformatics.2002,3(18):5～15.
    [27]Sankoff D.. Simultaneous solution of the RNA folding alignment and protosequence problems. SIAM J Appl Math.1985,30:2076-2082.
    [28]Mathews D.H., Turner D.H.. Dynalign:An algorithm for finding the secondary structure common to two RNA sequences. J.Mol.Biol.2002,317:191～203.
    [29]Havgaard J.H., Torarinsson E., Gorodkin J.. Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS computational Biology.2007,3(10):1896～1908.
    [30]Hofacker I.L., Bernhart S.H.F., Stadler P.F.. Alignment of RNA base pairing probability matrices. Bioinformatics.2004,20:2222-2227.
    [31]Torarinsson E., Havgaard J.H., Gorodkin J.. Multiple structural alignment and clustering of RNA Sequences. Bioinformatics.2007,23(8):926～932.
    [32]Evans P.A.. Finding common RNA pseudoknot structures in polynomial time.In: CPM 2006.LNCS 4009. Berlin:Springer-Verlag.2006:223～232.
    [33]Rolf Backofen, SvenSiebert. Fast detection of common sequence structure patterns in RNAs. Journal of Discrete Algorithm.2007(5):212～228.
    [34]Zhuozhi Wang, Kaizhong Zhang. Alignment between Two RNA Structures.In: CPM 99, LNCS 1645. Berlin:Springer-Verlag.2001:690～703.
    [35]Thompson J.D., Plewniak F., Pach O.. BAliBASE:a benchmark alignment database for the evolution of multiple alignment programs.Bioinformatics.1999,15(1):87～88.
    [36]Notredame C, Holm L, Higgins G. COFFEE:an objective function for multiple sequence alignment. Bioinformatics.1998,14(5),407～422.
    [37]Robert J. Klein, Sean R. Eddy. RSEARCH:Finding homologs of single structured RNA sequences. BMC Bioinformatics.2003.4.
    [38]Rolf Backofen, Danny Hermelin, Gad M. Landau,OrenWeimann.Local Alignment of RNA Sequences with Arbitrary Scoring Schemes.In:CPM 2006, LNCS 4009. Berlin:Springer-Verlag.2006,246-257.
    [39]Serge Dulucq, Laurent Tichit. RNA secondary structure comparison:exact analysis of the Zhang-Shasha tree edit algorithm.Theoretical Computer Science. 2003(306),471～484.
    [40]Zhang K.. A Constrained Edit Distance Between Unordered Labeled Trees. Algorithmica 1996(15):205～222.
    [41]Philip Bille. Tree Edit Distance, Alignment Distance and Inclusion. IT University Technical Report Series.2003.5-6.
    [42]唐焕文,秦学志.实用最优化方法.大连：大连理工大学出版社.2004：304～312.
    [43]Needleman S.B., Wunsch CD.. A general method applicable to the search for similarity in the amino acid sequences of two proteins. J Mol Biol.1970,48:443～453.
    [44]Smith F.F., Waterman M.. Identification of common molecular subsequences.J Mol Biol.1981(147):195～197.
    [45]Hofacker I.L.. Vienna RNA secondary structure server. Nucleic Acids Res. 2003,31:3429～3431.
    [46]BRAliBase 2.1. http://www.biophys.uni-duesseldorf.de/bralibase.
    [47]Hogeweg P.,Hesper B.. The alignment of sets of sequences and the construction of phylogenetic trees, an intergrated method. J Mol Evol.1984,20:175～186.
    [48]Feng D.F.,Doolittle R. F.. Progressive sequence alignmet as a preprequisite to correct phylogenetic tress. J. Mol. Biol 1987,25(4):351～360.
    [49]Thompson J., Higgins D., Gibson T.. ClustalW:improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl.Acids.Res. 1994,22:4673～4690.
    [50]Notredame C., Higgins D.,Herings J.. T-coffee:A novel method for fast and accurate multiple sequence alignment. J Mol Biol.2000(302):205～207.
    [51]张敏.生物信息学中多序列比对等算法的研究.大连理工大学博士学位论文.2005.第53页

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700