RNA二级结构预测算法的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
RNA(Ribonucleic Acid,RNA)分子在生物细胞中不仅充当着遗传信息的载体和传递工具,还具有催化RNA的剪接,加工和修饰RNA前体,调控基因表达和生物体的生长发育等一系列重要的功能,而功能与结构是密切相关的,因此对RNA分子结构的研究就成为分子生物学的一个重要领域。由于RNA分子具有降解速度快,难以结晶等特点,通过X射线晶体衍射和核磁共振等实验方法去测定RNA分子的立体结构花费的成本高、时间长,虽然测得的结果精确可靠,可是面对当前海量的生物序列,实验方法显然跟不上要求,因此RNA二级结构预测就成为研究RNA分子结构的主要手段。RNA二级结构预测是指借助于计算机手段和各种数学方法从理论上去预测RNA的空间结构,可为揭示RNA结构与功能的关系提供重要信息,大大提高认识RNA空间结构的效率。
     论文对目前主流的RNA二级结构预测算法的理论和实现方法进行了细致的研究。通过对基于热力学的预测方法(包括Zuker的最小自由能算法、遗传模拟退火算法、Hopfield神经网络方法、免疫粒子群算法)和比较序列分析方法(协同变异预测模型、随机上下文无关语法预测)以及基于机器学习的分类预测方法的分析,对这些算法存在的优缺点进行了比较研究,总结出了RNA结构预测方法发展的趋势和要求,为本文的预测算法奠定了理论和实验基础。
     首先论文分析了人工鱼群智能算法在优化问题中的优势和不足,并针对基本人工鱼群算法在解决离散问题的过程中存在的的缺陷进行了相应改进,首次将鱼群算法应用到RNA二级结构预测问题中,建立了一种基于人工鱼群算法的最小自由能算法模型。在对算法编码实现时,采用集合表示状态点,能有效地缩小搜索空间,有利于算法在较短时间内找到目标解。仿真实验与传统的基于最小自由能的相关算法进行了比较研究,结果表明,使用改进鱼群算法进行RNA序列的二级结构预测能获得较理想的预测效果,能有效减少计算量、节省计算时间,特别当待测序列长度大于500时,鱼群算法在收敛速度上有着较明显优势。
     其次,研究了粒子群优化算法在组合优化问题中的应用背景,针对基本粒子群算法的早熟收敛,容易陷入局部最优且搜索精度不高等缺点,进行了相应的改进,提出了局部精英粒子群算法,在该算法中,通过改变粒子的邻居拓扑结构,使每个粒子拥有固定的局部邻居,每次迭代都会根据自身在邻居中的地位和状态以及历史最优值来调整下一步的状态。由于有效地保持粒子的多样性,使得算法有较好地跳出局部极值的特性。
     本文根据局部精英粒子群算法的思想构建了一套基于最小自由能思想的RNA二级结构预测模型。在对算法进行编码时,使用集合来表示粒子的状态,巧妙地将粒子运动的速度和状态函数使用集合之间的运算来重载,避免了传统粒子群算法参数选择的烦恼。实验数据有力地支持了改进后的粒子群算法和新的粒子运动状态编码方式。
     第三,通过扩展NSSEL(New Secondary Structure Element Labels,NSSEL)标签,创建了一套能够描述伪结结构信息的eNSSEL(extended NSSEL,eNSSEL)标签。一条RNA分子序列中的所有碱基都可以使用eNSSEL标签进行标记,从另一个角度来理解,即:任意一个碱基都可以被分类为某一个标签,因此,一条原始的RNA分子序列能与一条eNSSEL标签序列一一对应。由于eNSSEL标签携带了结构信息,因此,对于某一个RNA分子而言,只要得到其对应的标签序列,就可以知道其二级结构的组成。根据该思想,建立了基于SVMs(support vector machines,SVMs)的分类预测模型。该模型通过有效训练后,在可接受的预测精度范围里具有较低的计算复杂度,能有效地解决传统算法中存在的计算复杂性问题,为预测长链分子提供了一种全新的思路。
RNA can act as a carrier of genetic information, a catalyst of biochemical reactions, an adapter molecule in protein synthesis, and a structural molecule in cellular organelles.The function of a RNA molecule has a close relation with its secondary structure. So the research work about RNA secondary structure has become more and more important.The Most accurate structure can be determined by X-ray diffraction or nuclear magnetic resonance,but this is difficult because not only it is expensive and slow but also most RNAs can not be crystallized currently.Obviously it is necessary to study RNA secondary structure by means of some prediction algorithms based on computation theory,these prediction algorithms can raise efficiency greatly for scientists to know RNA secondary structure.
     Firstly,some major algorithms and theories for RNA secondary structure prediction are introduced in this dissertation,they include the methods based on thermodynamic energy minimization principle(such as Zuker’s mfold mehod,Genetic simulated annealing algorithm,Hopfield network method,and Immune particle swarm algorithm), phylogenetic comparative methods(covariance mutation prediction model, stochastic context free grammar algorithm),and classification method with BP neural network. The author analyzes the advantage and limitation of these algorithms by comparing them,and then gives a summary of development demand and trendency for RNA secondary structure prediction.
     Nextly,a modified artificial fish swarm optimization algorithm is presented in this paper,which can self-adaptively change some parameter’s values , improve the behavior of fish and reduce the search space,therefore it can rapidly close to the global best result by avoiding the blindness of searching at the later stage owing to the improvement on basic fish swarm algorithm.At the same time,a RNA secondary structure prediction model based on modified fish swarm algorithm is presented,state space that is described with sets in our model largely shrinks the search space,so,when compared with SetPSO(Particle Swarm optimization with Set,SetPSO) method and GSA(Genetic Simulated Annealing,GSA)method,our algorithm not only is effective but also has much lower runtime cost.
     Thirdly, the PSO(Partcle Swarm Optimization,PSO)algorithm is analysed in detail.There are some defects for its application in solving combinatorial optimization problems,for example,the PSO algorithm is prone to stay in local optimal results instead of the global optimal objective answer due to it’s rapid convergence.Aiming at this drawback,an improved algorithm called LEPSO(Local Elite Particle Swarm Optimization,LEPSO) algorithm is proposed.In the new algorithm,every particle has fixed neighbours, and adjusts its state for next step according to its past optimal value and the local impact factor of its neighbours in every iteration. Because of keeping diversiform directions of movement for each particle,the new algorithm can easily find global best result. The author realizes a RNA secondary structure prediction model with LEPSO algorithm,some new operators are defined to match the velocity and position formula,they skillfully avoid assigning values for all parameters,so that make the algorithm easily understandable.
     Lastly,the author sets up a set of labels which is called as eNSSEL(extended New Secondary Structure Element Label,eNSSEL) labels,these labels not only can express the simple stem-loop motif in a RNA secondary structure,but also can describe pseudoknots structure.Each base in a RNA molecule can be marked by a kind of label,so a RNA sequence can be converted to a label sequence. Corresponsively,a label sequence can be reverted to RNA secondary structures with a simple method.If only a label sequence for a RNA molecule has been known, the secondary structure for the molecule can be acquired by reversion.When a label is regarded as a structure type associated with a base in RNA molecule,the problem for RNA secondary structure prediction can be considered a classification problem. Consequently a SVMs(Support Vector Machines,SVMs) model for RNA secondary structure prediction is created in terms of classification theory.Experiment shows that this model not only can solve high runtime complexity problem of traditional algorithms, but also can efficiently predict long RNA sequences which are difficult with traditional folding algorithms.it gives a new thought of RNA secondary structure prediction for long sequence with pseudoknots.
引文
[1] T K ATTWOOD.生物信息学概论[M].北京:北京大学出版社,2002(4).
    [2]李衍达,孙之荣等译.生物信息学——基因和蛋白质分析的实用指南[M].北京:清华大学出版社,2000.
    [3]郝柏林,张淑誉.生物信息学手册[M].第2版.上海:上海科学技术出版社,2002.
    [4]赵国屏等.生物信息学[M].北京:科学出版社,2002.
    [5] ZUKER M.,STIEGLER P.Optimal computer folding of large RNA sequences using thermodynam ics and auxiliary information[J].Nucleic Acids Res,1981,9 (1): 133-148.
    [6] ZUKER M.Prediction of RNA secondary structure by energy minimization[J]. Methods Mol Biol,1994,25:267-294.
    [7] Zuker M.,Mathews D.H.,Turner D.H.Algorithms and thermodynamics for RNA secondary structure prediction:A practical guide[M].In RNA Biochemistry and Biotechnology. Boston: Kluwer Academic Publishers, 1999: 11-43.
    [8] Zuker M. Computer prediction of RNA structure.Methods in Enzymology[J]. 1989, 180: 262-288.
    [9] Zuker M.Calculating nucleic acid secondary structure[J].Curr.Opin.Struct.Biol., 2000, 10:303-310.
    [10] NUSSINOV R.,JACOBSON A.B.Fast algorithm for predicting the secondary structure of single-stranded RNA[C].Proc Natl Acad Sci USA,1980,77(11):6309-6313.
    [11] MATHEWS D.H.,TURNER D.H.,Dynalign.An algorithm for finding the secondary structure common to two RNA sequences[J].J Mol Biol.,2002,317(2):19-203.
    [12] PARSCH J.,BRAVERMAN J.M,STEPHAN W.Comparative sequence analysis and patterns of covariation in RNA secondary structures[J].Genetics,2000(154):909-921.
    [13] Durbin R.,et al.Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids[M].Cambridge UK: Cambridge University Press, 1998.
    [14] Knudsen B.,Hein J.RNA Secondary Structure Prediction Using Stochastic Context-free Grammars and Evolutionary History[J]. Bioinformatics,1999,15:446-454.
    [15] Knudsen B., Hein J.Pfold:RNA secondary structure prediction using stochastic context-free grammars[J]. Nucl. Acids Res.,2003(31):3423-3428.
    [16] Brow M.P.Small subunit ribosomal RNA modeling using stochastic context-free grammars[C].Proc.Int. Conf. Intel. Syst. Mol. Biol.,2000(56):57-66.
    [17] Holmes I.Rubin D.H.Pairwise RNA structure comparison with stochastic context-freegrammars[C].In Pacific Symposium on Biocomputing,2002:191–203.
    [18] Eddy S.R.,Durbin R.RNA Sequence Analysis Using Covariance Models[J].Nucl. Acids Res.,1994(22):2079-2088.
    [19] Sakakibara Y.,Brown M., et al.Stochastic context-free grammars for modeling RNA[J]. Technical Report: UCSC-CRL-93-16,1993.
    [20] Rivas E.,Eddy S.R.The Language of RNA:A Formal Grammar That Includes Pseudoknots[J].Bioinformatics,2000(16):334-340.
    [21] Sakakibara Y.,Brown M.,et al.Stochastic context-free grammars for tRNA modeling[J].Nucl. Acids Res.,1994(22):5112-5120.
    [22] Rivas E.,Eddy S.R.A Dynamic Programming Algorithm or RNA Structure Prediction Including Pseudoknots[J].J.Mol.Biol.,1999(285):2053-2068.
    [23] DIRKS R M.,PIERCE N A.A partition function algorithm for nucleic acid secondary structure including pseudoknots[J].J Comput Chem,2003,24(13):1664-1677.
    [24] CAI L.,MALMBERG R L.,WU Y..Stochastic modeling of RNA pseudoknotted structures: a grammatical approach[J].Bioinformatics,2003,19(Suppl 1):166-173.
    [25] CARY R B.,STORMO G D.Graph-theoretic approach to RNA modeling using comparative data[C].Proc Int Conf Intell Syst Mol Biol,United States:AAAI Press,1995(3):75-80.
    [26] TABASKA J E.,CARY R B.,GABOW H N.,et al.An RNA folding method capable of identifying pseudoknots and base triples[J].Bioinformatics,1998,14(8):691-699.
    [27] Gilbert W. The RNA World[J]. Nature,1986(319):618.
    [28]于自然,黄熙泰.现代生物化学[M].北京:化学工业出版社, 2001.
    [29]张玉静.分子遗传学[M].北京:科学出版社,2000.
    [30]转运RNA[Z].http://baike.bbioo.com/doc-view-10.htm.
    [31]龚子端,韩洪波,非编码RNA的进展[J].攀枝花学院学报,Vol.23,No.6,2006.
    [32]李恒武,朱大铭,纪秀花.RNA二级结构预测算法的设计与实现[J].计算机工程与科学,Vol.28,No.7,2006.
    [33]宁正元,林世强.RNA二级结构预测方法[J].福建农林大学学报(自然科学版), Vol.36,No.1,2007.
    [34] Nussinov R.,Pieczenik G.,et al.Algorithms for Loop Matchings[J].SIAM J.Appl. Math., 1978(35):68-82.
    [35] Zuker M., Sankoff D. RNA secondary structures and their prediction[J].Bulletin of Mathematical Biology,1984(46):591-621.
    [36] Lyngso R.B., Zuker M.,et al.Fast Evaluation of Internal Loops in RNA Secondary Structure Prediction[J].Bioinformatics,1999(15):440-445.
    [37] Mout D.W.Bioinformatics:Sequence and Genome Analysis[C].New York:Cold Spring Harbor Laboratory Press, 2001.
    [38] Pleij C.W.A.Pseudoknots:a new motif in the RNA game[J].Trends Biochem. Sci., 1990(15):143-157.
    [39] Van Batenburg F.H.D,Gultyaev A.P., Pleij C.W.A.An APL-programmed Genetic Algorithm for the Prediction of RNA Secondary Structure [J ].J Theo Biol,1995, 174(3):269-280.
    [40] Schmitz M.,Steger G.Description of RNA folding by“Simulated Annealing”[J ]. J. Mol Biol.,1996,255(1):254-266.
    [41] Shapiro B.A.,Wu,J.C.An annealing mutation operator in the genetic algorithm for RNA folding[J].Comput. Appl. Biosci.,1996(12):171-180.
    [42] Kim J.,Cole J.R.,Prmanik S.Alignment of possible secondary structures in multiple RNA sequences using smulated annealing[J].CABIOS,1996(12):259-267.
    [43] Shapiro B.A.,Navetta J.A massively parallel genetic algorithm for RNA secondary structure prediction[J].J.Supercomput.,1994(8):195-207.
    [44] Shapiro B.A.,Wu J.C.,et al.The Massively Parallel Genetic Algorithm for RNA Folding: MIMD Implementation and Population Variation[J].Bioinformatics,2001(7):137-148.
    [45] Taneda A.Cofolga:a genetic algorithm for finding the common folding of two RNAs[J].Computational Biology and Chemistry,2005(29):111-119.
    [46] Chen J.H.,Le S.Y.,et al.Prediction of Common Secondary Structures of RNAs: A Genetic Algorithm Approach[J].Nucl. Acids Res.,2000(28):991-999.
    [47] Notredame C.,O’Brien E.A.,et al.RAGA:RNA Sequence Alignment by Genetic Algorithm[J].Nucl.Acids Res.,1997(25):4570-4580.
    [48] Shapiro B.A.,Bengali D.,et al.RNA folding pathway functional intermediates: their prediction and analysis[J].J. Mol.Biol.,2001(312):27-44.
    [49]叶世伟,史忠植译,神经网络原理[M].北京:机械工业出版社,2004.
    [50]刘琦,张引,叶修梓等.基于离散Hopfield网络求解极大独立集的茎区选择算法以及在RNA二级结构预测中的应用[J].计算机学报,Vol.31,No.1,2008.
    [51]李有梅,徐宗本,苗夺谦.用神经网络启发式算法求解最大独立集问题[J].模式识别与人工智能,2003,16(1):76-80).
    [52] Kennedy J.,Eberhart R.C.Particle swarm optimization[C],Proc IEEE Int’Conf on Neural Networks IV,Piscataway,NJ:IEEE Service Center,1995:1942-1948.
    [53] Eberhart R.C.,Shi T.C.Comparison between Genetic Algorithm and Particle swarm optimization[C],Lecture Notes in Computer Science 1447: Evolutionary Programming VII[S.I.]:Springer 1998:611-616.
    [54] Ho S.L,Yang S.,Ni G.,et al.A Particle swarm optimization-based method for multiobjective design optimization[J],IEEE Trans Magnetics,2005,41(5):1756-1759.
    [55] Ratnaweera A.,Halgamuge S.K.,Watson H.C.Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients[J].IEEE Trans Evol Comput, 2004,8(3):240-255.
    [56] Michael Geis,Martin Middendorf, A Particle Swarm Optimizer for Finding Minimum Free Energy RNA Secondary Structure[C].Proceedings of the 2007 IEEE Swarm Intelligence Symposium.
    [57]胡桂武,彭宏.基于免疫粒子群集成的RNA二级结构预测算法[J].计算机工程与应用,Vol.43,No.3,2007.
    [58]郭颖.RNA的二级结构[学位论文].大连理工大学,2005.3.
    [59] Wuchty S.,Fontana W.,Hofacker I.L.,Schuster P.Complete suboptimal folding of RNA and the stability of secondary structures[J].Biopolymers,1999(49):145-165.
    [60]张秀苇,邓志东,宋丹丹.RNA二级结构预测的神经网络方法[J].清华大学学报(自然科学版),Vol.46,No.10,2006.
    [61] Fogel G.B.,Porto V.W.,et al.Discovery of RNA structural elements using evolutionary computation[J].Nucl. Acids Res.,2002(30):5310-5317.
    [62] Knudsen B.,Andersen E.S.,et al.Evolutionary rate variation and RNA secondary structure prediction[J].Computational Biology and Chemistry,2004(28):219-226.
    [63] Nakaya A.,Yamamoto K.,Yonesawa A.RNA secondary structure prediction using highly parallel computers[J].Comput. Applic. Biosci.,1995(11):685-692.
    [64] Fekete M.,Hofacker I.L.,Stadler P.F.Prediction of RNA Base Pairing Probabilities on Massively Parallel Computers[J].Jounal of Comput. Biology,2000(7):171-182.
    [65]胡中功,李静.群智能算法的研究进展[J].自动化技术与应用,Vol.27,No.2,2008.
    [66]李晓磊,钱积新.人工鱼群算法:自上而下的寻优模式[C].过程系统工程年会论文集,2001:76-82.
    [67]王文杰,叶世伟.人工智能原理与应用[M].北京:人民邮电出版社,2004.
    [68] HACKWOOD S.,BENI G.Self—organization of sensors for Swarm ntelligence[C]. IN:IEEE International conference on Robotics and utomation.Piscataway,NJ:IEEE Press,1992:819-829.
    [69]段海滨,王道波,于秀芬.几种新型仿生优化算法的比较研究[J].计算机仿真,Vol. 24,No.3,2007.
    [70]李晓磊.一种新型的智能优化方法-人工鱼群算法[学位论文].浙江大学,2003.
    [71]黄华娟,周永权.改进型人工鱼群算法及复杂函数全局优化方法[J].广西师范大学学报(自然科学版),Vol.26,No.1,2008.
    [72] Van Bantenburg F.H.D.,Gultyaev A.P.,et al.An APL-programmed Genetic Algorithm for the Prediction of RNA Secondary Structure[J].J.theor.Biol.,1995(174): 269-280.
    [73] Benedetti G.,Morosetti S.A genetic algorithm to search for optimal and suboptimal RNA secondary structures[J].Biophys. Chem.,1995(55):253-259.
    [74]邹权,郭茂祖,张涛涛.RNA二级结构预测方法综述[J].电子学报,Vol.36,No.2,2008.
    [75] Neethling M.,Engelbrecht A.P.Determining rna secondary structure using set-based particle swarm optimization[C].IEEE Congress on Evolutionary Computation (CEC2006), 2006:1670-1677.
    [76]刘海军.RNA二级结构预测的建模及其应用研究[学位论文].上海大学,2005.
    [77]曾建潮,介倩,崔志华.微粒群算法[M].北京:科学出版社,2004.
    [78] Heppner F.,U. Grenander. A stochastic nonlinear model for coordinated bird flocks[M]. In S. Krasner, Ed., the Ubiquity of Chaos. AAAS Publications,Washington.DC,1990.
    [79] J.Kennedy, R.Eberhart. Particle Swarm Optimization [C]. In: Proc IEEE Int Conf on Neural Networks, 1995:1942-1948.
    [80] Shi Y.,Eberhart R. A Modified Particle Swarm Optimizer[C]. In: Proceedings of the IEEE International Conference on Evolutionary Computation. Piscataway,NJ:IEEE Press, 1998:69-73.
    [81] Coleman T.F,Y. Li.An Interior Trust Region Approach for Nonlinear Minimization Subject to Bounds[J].SIAM Journal on Optimization,1996(6):418-445.
    [82] Coleman T.F,Y Li.On the Convergence of Reflective Newton Methods for Large-Scale Nonlinear Minimization Subject to Bounds[J]. Mathematical Programming, 1994, 67(2):189-224.
    [83]樊玮,粒子群优化方法及其实现[J].航空计算技术,2004(3):39-40.
    [84] Shi Y.,Eberhart R.C.Fuzzy adaptive particle swarm optimization[C].Proceedings of the IEEE Congress on Evolutionary Computation.Seoul,Korea,2001,101-106.
    [85]张建科.几类改进的粒子群算法[学位论文].西安电子科技大学,2006.
    [86]刘晶晶.粒子群优化算法的改进与应用[学位论文].武汉理工大学,2007.
    [87] V. Cherkassky, F. Mulier.Learning from Data:Concepts[M]. Theory and Methods,New York,John Wiley & Sons,1997.
    [88] Vapnik V.N.The Nature of Statistical Learning Theory[M]. NewYork: Springer-Verlag, 1995.(张学工译.统计学习理论的本质[M].北京:清华大学出版社,1999.
    [89] Vapnik V.N. Statistical Learning Theory [M]. J.Wiley,New York,1998.(许建华,张学工译.统计学习理论[M].北京:电子工业出版社,2004.)
    [90] Vapnik V.,Levin E.,Le Cun Y.Measuring the VC-dimension of a learning machine[J].NeuralComputation,1994(6):851-876.
    [91]张学工.关于统计学习理论与支持向量机[J].自动化学报,2000(1):32-42.
    [92] Chih-Wei Hsu,Chih-Jen Lin.A Comparison of Methods for Multiclass Support Vector Machines[C]. IEEE Transactions on Neural Networks,2002,13(2):415-425.
    [93] C. J. C. Burges, A. J. Smola,Editors.Advances in Kernel Methods: Support Vector Learning[M]. The MIT Press,Cambridge,MA,1999,255– 268.
    [94] Dietterich,T. G. Bakiri. Solving multiclass learning problems via error correcting output codes [J]. Journal of Artificial Intelligence Research,1995(2):263-286.
    [95] Cecilio Angulo,Xavier Parra,Andreu Catala.K-SVCR A support vector machine for multi-class classification[J]. Neuro computing,2003 (55):57 -77.
    [96]王睿.支持向量机中参数选择方法分析.重庆师范大学学报(自然科学版),Vol.24, No.2,2007.
    [97] Tuerk C.,MacDougal S.,et al.RNA pseudoknots that inhibit human immunodeficiency virus type 1 reverse transcriptase[C], Proc Natl Acad. Sci. USA, 1992(89): 6988-6992.
    [98] Van Belkum A., Abrahams J.,et al.Five pseudoknots are present at the 204 nucleotides long 3’noncoding region of tobacco mosaic virus RNA[J].Nucleic Acids Res., 1985(13): 7673-7686.
    [99] Ferre-D Amare A., Zhou K.,et al.Crystal structure of a hepatitis delta virus ribozyme[J].Nature,1998(395):567-574.
    [100] Rietveld K., Poelgeest R.V., et al. The tRNA-like structure at the 3’terminus of turnip yellow mosaic virus RNA: differences and similarities with canonical tRNA[J]. Nucleic Acids Res., 1982(10):1929-1946.