RNA序列比对算法研究

英文题名：Research on the Algorithms of RNA Sequence Alignment
作者：方曾营
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：生物信息学 ; RNA结构 ; 序列比对 ; 动态规划
英文关键词：Bioinformatics ; RNA Structure ; Sequence Alignment ; Dynamic Programming
学位年度：2008
导师：张祖平
学科代码：081001
学位授予单位：中南大学
论文提交日期：2008-05-01

摘要

生物信息学是一门综合利用生物学、计算机科学、数学等学科知识的新兴交叉学科。RNA序列比对是生物信息学研究的重要课题,特别是包含二级和三级结构的比对。由于RNA序列数据量大,折叠的结构非常复杂,造成序列的比对是一个复杂度高而又很难有实际检验的过程,其中RNA三级结构比对是NP-hard问题。如何提高序列比对的速度,以及解决三级结构比对问题是本课题研究的重点。
     本文在深入分析现有比对算法及其实现软件的基础上,利用RNA二级树形结构模型,深入分析了RNA二级结构对比对算法,对算法进行详细的阐述与分析。论文针对RNA三级结构比对难点,提出了基于二级结构映射的序列三级结构比对算法,以及基于二级结构转化的三级结构比对算法,对比对结果进行了分析。
     论文对算法的实现及其比对软件的测试结果也作了深入的分析,实验结果表明二级结构比对算法具有较好的时间特性;三级结构比对算法能够正确反映序列的相似度,并且和二级结构密切相关。在实现的过程中,还针对RNA海量数据的特点,提出内存优化和动态规划回溯优化策略,提高了处理的效率。
Bioinformatics is a new science field. Research in this field involves multi-disciplines such as biology, computer science, mathematics, etc. RNA sequence alignment is the important topics of biological information, especially the RNA that contains the secondary and tertiary structure. Because the data of RNA sequence is massive, and the structure is very complex, so that sequence alignment is a time-consuming process, which is difficult to be proofed, the tertiary structure of RNA alignment is NP-hard problem. Improve the speed of sequence alignment, as well as the tertiary structure alignment, is the focus of this research.
     After the deep analysis of existing assembly methods and implementation softwares, we analyze the optimized algorithm of RNA secondary structure alignment by the tree-model. In addition, the paper discusses the difficulties of RNA tertiary structure alignment, and gives two tertiary structure alignment algorithms which base on the mapping of the secondary structure sequence, and transforming to the secondary structure, and analyze the result.
     Finally, the paper makes a deep analysis into the realization of improved algorithms and also presents some experiments of them. The testing results indicate that the algorithm of secondary structure alignment has a good time complexity, and that of the tertiary structure can tell the sequence similarity correctly, and it is related closely with the number of secondary structure. In the process of realization, consider about the characteristics of massive data, we put forward some optimization strategy, it includes memory optimization and dynamic programming retrospect optimization and it has a high processing efficiency.

引文

[1]郝柏林.生物信息学.中国科学院院刊,2000,4:260-264http://www.bioinfo.org.cn/course/crsl.htm
    [2]张博锋.全基因组DNA测序中的片段拼接方法及其并行处理:[硕士学位论文].长沙:国防科学技术大学,2002
    [3]C.Gibas,P Jambeck.Developing Bioinformatics Computer Skill.北京:科学出版社,2002
    [4]陈润生.生物信息学—基因组研究的有力工具.2003-07-15
    [5]贺林.解码生命—人类基因组计划和后基因组计划.北京:科学出版社,2000
    [6]涂俐兰.生物序列拼接及其算法。生命科学研究,2003,7(2):79-82
    [7]E.Pennisi.The Human Genome Science,2001,291:1177-1980
    [8]张阳德.生物信息学.北京:科学出版社,2004
    [9]Lipman,Pearson.Rapid and sensitive protein similarity searches.Science,1985(227):1435-1441
    [10]S Needlman,C Wunch.A general method applicable to the search for similarities in the amino acid sequences of two proteins.Journal of Molecular Biology,1970,48:443-453
    [11]Waterman M.S..Efficient Sequence Alignment Algorithms.Theor.Biol.,1984,108-333
    [12]Waterman M.S..Sequence Alignment,Math.Methods for DNA Sequence.Editor M.S.Waterman.CRC Press,Inc..1989
    [13]Waterman M.S..Introduction to Computational Biology.Chapman and Hall,London.1995
    [14]Hirschberg D.A linear space algorithm for computing maximal common subsequences.Comm ACM,1975,18(6):341-343
    [15]Hirschberg D.Serial Computations of Levenshtein Distances.A.Apostolicl,Z.Galil(Eds.),Pattern Matching Algorithms,Oxford University Press,1997,123-141.
    [16]李昭,杨琪,祝明发.存储约束条件下的序列联配算法.微电子数与计算机,2002,19(6):1-5
    [17]Jingping Liu,Bin Ma,Kaizhong Zhang.An Algorithm for Searching RNA Motifs in Genomic Sequences.Biomolecular Engineering.doi:10.1016/j.bioeng,2007.02.005
    [18]Greory D.Collins,Shuyun Le,Kaizhong Zhang.A new algorithm for computing similarity between RNA structures.2001(139):59-77
    [19]Bin Ma,Kaizhong Zhang:On the Longest Common Rigid Subsequence Problem.CPM 2005:11-20
    [20]Bin Ma,Lieyu Wu,Kaizhong Zhang:Improving the Sensitivity and Specificity of Protein Homology Search by Incorporating Predicted Secondary Structures.International Conference on Computational Science,2005(2):960-967
    [21]Kaizhong Zhang:RNA Structure Comparison and Alignment.Data Mining in Bioinformatics 2005:59-81
    [22]Dennis Shasha,Jason Tsong-Li Wang,Huiyuan Shan,Kaizhong Zhang:A Tree Grep:Approximate Searching in Unordered Trees.SSDBM 2002:89-98
    [23]Guohui Lin,Bin Ma,Kaizhong Zhang:Edit distance between two RNA structures.RECOMB 2001:211-220
    [24]Zhuozhi Wang,Kaizhong Zhang:Multiple RNA Structure Alignment.CSB 2004:246-254
    [25]Sctubal J.,Meidanis J.计算分子生物学导论.朱浩等译.北京:科学出版社,2003
    [26]周康.关于DNA序列比对算法的简述.武汉工业学院学报.2005,24(2):99-101
    [27]Zuker M.Optimal computer folding of large RNA sequence using thermodynamics and auxiliary information.NucL.Acid Res.1981,9:133-148
    [28]郭颖,李大超.一类RNA二级结构的计数.海南师范学院学报(自然科学版).2005,3(18):16-20
    [29]P.Green.Lecture on Sequences Assembly.Washington University,1998.
    [30]沈世镒.生物序列突变于比对的结构分析.北京:科学出版社,2004
    [31]El-Mabrouk,N.,Raffinot,M..Approximate Matching of Secondary Structures.ACM,2002,156-164
    [32]郝柏林.生物信息学浅说.上海:上海科技教育出版社,2003
    [33]Le,S.,Zhang,K.,and Maizel,J.V.,Jr.,1995.A Method for Predicting Common Structures of Homologous RNAs.Computers and Biomedical Research 28,pp.53-66
    [34]V.Bafna,S.Muthukrishnan,and R.Ravi,'Comparing similarity between RNA string',Proc.Combinatorial Pattern Mathing Conf.95,LNCS937,pp.1-14,1995
    [35]PAVESI G,GIANCARLO Mauri,MARCO Stefani,et al.RNA Profile:an algorithm for finding conserved secondary structure motifs in unaligned sequences[J].Nucleic Acids Res,2004,32(10):3258-3269
    [36]GORODKIN J,HEYER L J,STORMO G D.Finding the most significant common sequence and structure motifs in a set of RNA sequences.Nucleic Acids Res,1997,25(18):3724-3732
    [37]GORDKIN J,STRICKLIN S L,STORMO G D.Discovering common stem-loop motifs in unaligned RNA sequence.Nucleic Acids Res,2001,29(10):2135-2144
    [38]CARY R B,STOMO G D.Graph-theoretic approach to RNA modeling using comparative data.Proc Int Conf Intell Syst Mol Biol.United States:AAAI Press.1995,3:75-80
    [39]TABASKA J E,CARY R B,GABOW H N,et al.An RNA folding method capcale of identifying pseudoknots and base triples.Bioinformatics,1998,14(8):691-699
    [40]Kaizhong Zhang:Efficient Parallel Algorithm for the Editing Distance between Ordered Trees.CPM 1998:80-90
    [41]Kaizhong Zhang:RNA Structure Comparison and Alignment.Data Mining in Bioinformatics 2005:59-81
    [42]Eddy,S.R.,2002.A Memory-Efficient Dynamic Programming Algorithm for Optimal Alignment of a Sequence to an RNA Secondary Structure.BMC Bioinformatics,3:18
    [43]Lafferriere,A.,Gautheret,D.,and Cedergren,R.An RNA Pattern Matching Program with Enhanced Performance and Portability.Bioinformatics,1994,Vol.10,pp.211-212
    [44]Shu-Yun Le,Jacob V.Maizel Jr.,Kaizhong Zhang:An Algorithm for Detecting Homologues of Known Structured RNAs in Genomes.CSB 2004:300-310
    [45]J.W.Brown,The ribonuclease P database,Nucleic Acids Res.27(1999)314
    [46]Eddy,S.R.,2001.Non-coding RNA Genes and the Modern RNA World.Nature Rev Genet 2:919-929
    [47]Krichevsky,A.M.,King,K.S.,Donahue,C.P.,Khrapko,K.,and Kosik,K.S.,2003.A MicroRNA Array Reveals Extensive Regulation of MicroRNA during Brian Develoment.RNA,9(10):1274-1281
    [48]欧阳钟辉,马瑾玉.海量信息关联管理.福建电脑,2007,5:1-4
    [49]方旺盛,郑剑,邵利平.一种基于文件缓冲方式的操作大数据量数据的方法.计算机技术与自动化,2004,23(4):90-92
    [50]Simons,R.W.,and Grunberg-Manago.RNA Structure and Function.Cold Spring Harbor Laboratory Press.1998,New York
    [51]Mathews D H,Turer D H.Dynalign:an algorithm for finding the secondary structure common to two RNA sequences.J Mol Biol,2002,317(2):191-203
    [52]徐鹏,王克宏.Java程序内存空间优化策略的研究.计算机科学,2002,29(4):35-38
    [53]邹玉丽,罗峰.Java开发中有关大对象的处理方法.电脑学习,2005,4(2):53-54
    [54]CYNTBIA G,PER J.Developing Bioinformatics Computer Skills.北京:科学出版社,2002
    [55]赵源,李竹,裘鸿林.基于垃圾收集的Java程序性能改善方法.计算机应用研究,2005,10:217-219
    [56]Ukkonen E.On approximate string matching.Found Comput Theory,1983,158(6):487-495
    [57]David R,Powell,Lloyd Allison,et al.A versatile divide and conquer technique for optimal string alignment 1999,70(3):127-139
    [58]唐玉荣,汪懋华.基于动态规划的快速序列比对算法.生物数学学报,2005,20(2):207-212
    [59]H.Shang,T.H.Merrttal.Tries for Approximate String Mathing.IEEE Transactions on Knowledge and Data Engineering,1996,8(8):4-7
    [60]Zuping Zhang,Cengying Fang.A Distributed Computing System Applied to Computational Biology.DCABES2007 PROCEEDINGS Volume 1:117-120

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700