RNA二级结构预测算法的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年以来,越来越多的研究表明RNA在生命过程中发挥着非常重要的作用。RNA不仅是具有生物细胞结构的遗传讯息的中间载体,还具有基因表达调控、催化mRNA的剪接、加工和修饰RNA前体等其它重要功能。因此,对RNA分子的研究一直是生物信息学中的一个重要领域。而不同RNA所具有的功能与RNA的分子结构却有着密切的关系,为了更进一步的探索其更多的功能,就需要借助于RNA的二级结构。因为RNA分子自身所具有的难以结晶、降解速度快等特点,所以通过核磁共振(Nuclear magnetic resonance)或者X-射线晶体衍射和其他常规的实验方法预测RNA三维结构的费用高,耗时长。尽管通过常规的方法来确定RNA结构可以更加精确和可信,但是面对代价昂贵以及当前的海量数据,显然是满足不了需求的。所以,利用计算机实现的各种算法和数学方法来预测RNA二级结构成为公认的主要方法。
     本文对当前主流的RNA二级结构预测方法进行了较为深入地研究,包括基于热力学的方法(最小自由能方法、碱基最大配对法等);比较序列分析法(共变模型、随机上下文无关语法模型);启发式算法(遗传算法、模拟退火算法)等。通过对这些方法的研究,总结出其所各自所存在的优缺点,为本文的预测算法奠定了坚实的理论基础。
     首先,本文研究了使用最小二乘法支持向量机,从RNA序列特征入手对非编码RNA进行基因预测,相对于传统的支持向量机把解二次规划问题转化为求解线性方程组问题。在预测算法中结合主成分分析提取RNA序列的特征,对数据进行维数压缩,排除了主观因素的干扰,减少变量存储空间和计算量。通过对10种原核生物的tRNA序列的实验测试表明,本方法是一种能够有效预测原核生物ncRNA的方法。
     其次,本文研究了粒子群优化算法在RNA二级结构预测问题中的应用,提出了基于该方法来预测RNA二级结构的模型(PSOfold)。为了提高搜索最优解的能力,结合了模糊逻辑控制自适应动态地调控粒子群优化算法的参数,包括惯性权重、学习因子和粒子数量比。
     为了进一步解决PSOfold中的茎区排列问题,我们提出一种解转化策略,将离散值转换为一个有序的茎区组合。实验中选用了10条RNA序列分别从敏感性、特异性和F-measure度量与多种其他方法进行了比较。实验结果表明,这种方法是有效的并且优于其它基于进化算法和群智能的算法。
In recent years, more and more research shows that RNA plays a very importantrole in the life process. The RNA molecules are not only the carriers of geneticinformation in living cells,but also has some other important functions,such asregulating the gene expression, catalysising mRNA splice,processing and modifyingthe precursors of RNA and so on. So the research work about the RNA molecules isalways one of the important fields in the bioinformatics. The function of RNAmolecules has very close relation with their stuctures. In order to make furtherexploration, we need with the aid of RNA secondary structure.The most accuratemethod can be use by X-ray diffraction or nuclear magnetic resonance, but this isdifficult because not only it is expensive and slow but also most RNA molecules cannot be crystallized currently. Therefor, the recognized main method is by usingcomputer to realize all kinds of algorithms.
     In this paper, we study the methods of the RNA secondary structure prediction indepth. They include: the methods based on thermodynamic energy minimizationprinciple (such as Zuker’s mfold mehod, base pair maximization algorithm el.), themethod of Comparative sequence analysis (such as covariance mutation predictionmodel, stochastic context free grammar algorithm), the heuristic algorithm (geneticalgorithm, Simulated Annealing) and so on. Through the research of those methods,wesum up their respective advantages and disadvantages,and found the research idea ofthe new prediction method,which has laid a solid theoretical foundation for thecompletion of the paper work.
     Firstly, we use least squares support vector machine (LS-SVM) on the basis ofprincipal component analysis to predict tRNA. Due to the equality constraints in theformulation, a set of linear equations has to be solved instead of a quadraticprogramming problem. Compared with the traditional support vector machine (SVM),LS-SVM converted the inequality constraints into equality ones and made the trainingof the SVM equivalent to solving a group of equalities.Principal component analysis (PCA) is commonly used for feature extraction from high dimensional data. We usedthis approach to analyze the statistical features of nucleotide sequence, and then use theLS-SVM to predict the ncRNA.The results indicate that the proposed method isadoptable for prokaryotic ncRNA prediction.
     Secondly, we propose PSOfold, a particle swarm optimization for RNA secondarystructure prediction, to improve the performance of the recently published IPSO. Toenhance the searching ability of optimal solution, fuzzy logic control is applied toadaptively adjust the PSO parameters, which are inertia weight, learning factors and thenumber of ants, respectively.
     Finally, to further settle the stem permutation problem, we put forward a solutionconversion strategy to transform the discrete values of stems into an ordered stemcombination. The experimental results show that our method is effective for RNAfolding in terms of sensitivity, specificity and F-measure by comparing with othermethods based on evolutionary algorithms and swarm intelligence algorithms.
引文
[1] Jurado, J., et al. Alternative splicing of c-fos pre-mRNA: contribution of the rates ofsynthesis and degradation to the copy number of each transcript isoform anddetection of a truncated c-Fos immunoreactive species[J].BMC MolBiol,2007,8:83.
    [2] Zuker M., Mathews D.H., Turner D.H.Algorithms and thermodynamics for RNAsecondary structure prediction: A practical guide [M].In RNA Biochemistry andBiotechnology.Boston: Kluwer Academic Publishers,1999:11-43.
    [3] Gilbert W.The RNA World [J].Nature,1986(319):618.
    [4]刘琦.RNA二级结构的若干计算生物学问题的研究[D].浙江大学,2008.
    [5]张玉静.分子遗传学[M].北京:科学出版社,2000.
    [6]金由辛.核糖核酸与核糖核酸组学[M].科学出版社,2006.
    [7]刘永明.分子生物学简明教程[M].化学工业出版社.北京化学工业出版社,2006.
    [8]刘次全,白春礼,张静.结构分子生物学[M].北京:高等教育出社,1997:82-121.
    [9]郝柏林,张淑誉.生物信息学手册[M].第2版.上海:上海科学技术出版社,2002.
    [10] Mathews D H.Using an RNA secondary structure partition function to determi-neconfidence in base pairs predicted by free energy minimization [J].RNA,2004,10:1189-1190.
    [11]PleijC.W.A.Pseudoknots:a new motif in the RNA game[J].TrendsBiochen.SCI,1990,(15):143-157.
    [12]Take fuji, Y.et al.Parallel algorithms for finding a near-maximum independent setof acircle graph[J].IEEE Trans Neural Netw,1990,1(3):263-267.
    [13]刘琦.RNA二级结构的若干计算生物学问题的研究[D].浙江大学,2008.
    [14]郭颖.RNA的二级结构[D].大连理工大学,2005.
    [15]Durbin R,et al.Biological Sequence Analysis:Probabilistic Models of ProteinsandNucleic Acids[M].Cambridge UK:Cambridge University Press,1998.
    [16]Eddy.S.R.How do RNA folding algorithms work[J].Nature Biotechnology,2004,22(11):1457-1458.
    [17]韩乐,莫忠良.RNA-Z曲线及其在病毒基因识别中的应用[J].生物数学学报,2004,19:245-250.
    [18]Eddy S R.What is a hidden Markov model?[J].Nature Biotechnology,2004,22(10):1315-1316.
    [19]董浩. RNA二级结构预测方法研究.[D].吉林大学,2011.
    [20]EddyS.R. DurbinR. RNA sequence analysis using covariance models[J].Nucl.AcidsRes,1994,22:2079-2088.
    [21]Durbin R.et al.Biological Sequence Analysis: Probabilistic Models of Proteins andNucleic Acids[M].Cambridge UK: Cambridge University Press,1998.
    [22]何静媛.RNA二级结构预测算法的研究[D].重庆大学博士论文,2009.
    [23]陆健.基于动态权重匹配的RNA二级结构预测算法[D].江苏大学生物化学与分子生物学学院,2007.
    [24]P.巴尔迪,S.布鲁纳克著,张东晖等译.生物信息学—机器学习方法[M].北京:中信出版社,2003.
    [25]Rivas E,Eddy S R.The Language of RNA.A Formal Grammar That IncludesPseudoknots[J].Bioinformatics,200116(4):334-340.
    [26]Sakakibara Y,Brown M,et al.Stochastic context-free grammars for tRNAmodeling[J].Nucleotide Acids Research,1994,22(23):5112-5120.
    [27]Salvador I,Benedi J.RNA Modeling by combining stochastic context-freegrammars and n-grammodels[J].International Journal of Pattern Recognition andArtificial Intelligence,2002,16(3):309-315.
    [28]Lari K and Young S J.The estimation of stochastic context-free grammars using theinside-outside algorithm[J].Computer Speech Langues,1990,49(1):35-56.
    [29]Brow M,Wilson C.RNA pseudoknot modeling using intersections of stochasticcontext-freegrammars with applications to database search[C]. In PacificSymposium on Biocomputing,1995, pp:109-125.
    [30]Zuker M. Sankoff D. RNA secondary structures and their prediction [J].Bulletin ofMathematical Biology,1984(46):591-621.
    [31] Zuker M.Calculating nucleic acid secondary structure [J].Current opinion inStructural Biology,2000,10(3):303-310.
    [32]Zuker M.Computer Prediction of RNA structure [J]. Methods Enzymol,1989,180:262-288.
    [33]Chomsky N.On certain formal properties of grammars[J].Information Control,1959,2(2):137-167.
    [34]Akutsu Tasuya. Dynamic programming algorithms for RNA secondary structureprediction with pseudoknots [J]. Discrete applied mathematics,2000,104:45-62.
    [35]李伍举,吴加金.RNA二级结构的预测[J].军事医学科学院院刊,1996,20(4):298-310.
    [36]WuJu L.Prediction of RNA secondary structure based on he1ical regionsdistribution [J].Bioinformatics,1998,14(8):700-706.
    [37]PI Pas,j.M.and J.E.McMahon Method for Predieting RNA secondaryStructure[C].Proceedings of the National Academy of Seienees Of the UnitedStates of Ameriea,1975.72(6):p.2017-2021.
    [38]彭政.带假结的RNA二级结构预测算法研究[D].湖南:湖南大学计算机与通信学院,2008.
    [39]Tabaska J E,Cary R B,et al.An RNA Folding Method Capable of IdentifyingPseudoknots and Base Triples[J].Bioinformatics.1998,14(8):691-699.
    [40]Page R P M.Circles:Automating the Comparative Analysis of RNA SecondaryStructure[J].Bioinformatics,2000,16(11):1042-1043.
    [41]Page R D M.Comparative analysis of secondary structure of insect mitochondrialsmall subunit ribosoma RNA using maximum weighted matching[J].NucleotideAcids Research.2000,28(20):3839-3845.
    [42]纪震,廖惠连,吴青华,粒子群算法及应用[M],科学出版社,2009年,北京.
    [43]Holland J. Adaptation in Natural and Artificial Systems[M]. Ann Arbor, MI:University of Michigan Press,1975.
    [44]Goldberg D. Genetic algorithms in search, optimization, and machine learning[M].Boston: Addison-Wesley,1989.
    [45]Frenzel J F. Genetic algorithms[J]. IEEE Potentials,1993,12(3):21-24.
    [46]于丹. RNA二级结构预测算法分析与比较.[D].吉林大学.2009.
    [47]Shapiro B.A.,Wu,J.C. An annealing mutation operator in the genetic algorithm forRNAfolding[J].Comput.Appl.Biosci.,1996(12):171-180.
    [48]Metropolis N, Rosenbluth A, Rosenbluth M. Equation of state calculations by fastcomputing machines[J]. Journal of Chemical Physics,1953,21:1087-1092.
    [49]T. Lowe, S. Eddy, tRNA scan-SE: a program for improved detection of transferRNA genes in genomic sequence[J]. Nucleic Acids Res,1997,25:955-964.
    [50]D. Langenberger, C. Bermudez-Santana, P. Stadler, S. Hoffmann, Identification andClassification of Small RNAs in Transcriptome Sequence Data[C]. Pac. Symp.Biocomput,2010,15:80-87.
    [51]D. Mathews, D. Turner, Prediction of RNA secondary structure by free energyminimization[J]. Curr Opin Struc Biol,2006,16:270-278.
    [52]K. Sato, M. Hamada, K. Asai, T. Mituyama, CentroidFold: a web server for RNAsecondary structure prediction[J]. Nucleic Acids Res,2009,37:277-280.
    [53]P. Gardner, R. Giegerich, A comprehensive comparison of comparative RNAstructure prediction approaches[J]. BMC Bioinformatics,2004,5:140.
    [54]E. Rivas, S. Eddy, Noncoding RNA gene detection using comparative sequenceanalysis[J]. BMC Bioinformatics,2001,2:1.
    [55]A. Uzilov, J. Keegan, D. Mathews, Detection of non-coding RNAs on the basis ofpredicted secondary structure formation free energy change[J]. BMCBioinformatics,2006,7:173.
    [56]S. Kim, J. Nam, J. Rhee, miTarget: microRNA target gene prediction using asupport vector machine[J]. BMC Bioinformatics,2006,7:411.
    [57]S. Eddy, Computational genomics of noncoding RNA genes[J]. Cell,2001,109:137-140.
    [58]S. Kumar, M. Mandrita, Similarity-based fuzzy reasoning by DNA computing[J].Int. J. of Bio. Comput,2011,3(2):112-122
    [59]Z. Lu, K. Yip, G. Wang, C. Shou, L. Hillier, E. Khurana, A. Agarwal, R. Auerbach,J. Rozowsky, C. Cheng, M. Kato, D. Miller, F. Slack, M. Snyder, R. Waterston, V.Reinke and M. Gerstein, Prediction and characterization of noncoding RNAs in C.elegans by integrating conservation, secondary structure, and high-throughputsequencing and array data[J]. Genome Res,2011,21:276-285.
    [60]A. Marchais, M. Naville, C. Bohn, P. Bouloc and D. Gautheret, Single-passclassification of all noncoding sequences in a bacterial genome using phylogeneticprofiles[J]. Genome Res,2009,19:1084-1092.
    [61]S.-R Eddy, Computational Genomics of Noncoding RNA Genes[J]. Cell,2002,109:137-140.
    [62]S. Washietl, I. Hofacker, P. Stadler, Fast and reliable prediction of noncodingRNAs[J]. P Natl Acad Sci USA,2005,102:2454-2459.
    [63]T. Tran, F. Zhou, S. Marshburn, M. Stead, S. Kushner, Y. Xu, De novocomputational prediction of non-coding RNA genes in prokaryotic genomes[J].Bioinformatics,200925,2897-2905.
    [64]E. Torarinsson, M. Sawera, J. Havgaard, M. Fredholm, J. Gorodkin, Thousands ofcorresponding human and mouse genomic regions unalignable in primary sequencecontain common RNA structure[J]. Genome Res,200616,885-889
    [65]E. Torarinsson, Z. Yao, E. Wiklund, J. Bramsen, C. Hansen, J. Kjems, N.Tommerup, W. Ruzzo, J. Gorodkin, Comparative genomics beyond sequence-basedalignments: RNA structures in the ENCODE regions[J]. Genome Res,2008,18,242-251.
    [66]S. Washietl, J. S. Pedersen, J. Korbel, C. Stocsits, A. Gruber, J. Hackermuller, J.Hertel, M. Lindemeyer, K. Reiche, A. Tanzer, Structured RNAs in the ENCODEselected regions of the human genome[J]. Genome Res,2007,17:852-864.
    [67]V. Vapnik, The nature of statistical learning theory[M]. Springer-Verlag, New York1995.
    [68]C. Cortes, V. N. Vapnik, Support-Vector Networks[J]. Mach Learn,1995,20(3):273-297.
    [69]A. Smola, B. Scholkope, A tutorial on support vector regression[J]. Stat Comput,2004,14,1992222.
    [70]E. Osuna, R. Freund, An improved training algorithm for support vector machines,Proc[C]. of IEEE Workshop on Neural Networks and Signal Processing;(1997)September24-26,1997; Amelia Island, FL; pp.276-285.
    [71]R. Parpinelli, H. Lopes, New inspirations in swarm intelligence: a survey[J]. Int. J.of Bio. Comput,2011,3,1-16.
    [72]R. Yampolskiy, A. EL-Barkouky, Wisdom of artificial crowd’s algorithm forsolving NP-hard problems[J]. Int. J. of Bio. Comput,2011,3:358-369.
    [73]J. Suykens, J. Vandewalle, Least square support vector machine classifier[J].Neural Process Lett,1999,9:293-300.
    [74]J. Suykens, L. Lukas, Sparse approximation using least squares support vectormachines[J]. Proc. of the IEEE International Symposium on Circuits and Systems;(2000) May28-31,2000; Geneva, Switzerland; pp.757-760.
    [75]I. Jolliffe, Principal Component Analysis[M]. Springer-Verlag, New York,1986.
    [76]L. Zou, Z. Wang, Prediction of Non-coding RNA based on Neural Network withPrincipal Component Analysis[J]. J. Biomed Eng Res,2007,26:6-9.
    [77]Q. Ruan, Y. Wang, PCA approach to BP learning[J]. J. Fudan U (Nat Sci).2005,44(2):318-322.
    [78]Information on http://noncode.bioinfo.org.cn
    [79]Information on http://lowelab.ucsc.edu/GtRNAdb/
    [80] M. Zuker and P. Stiegler, Optimal Computer Folding of Large RNA SequencesUsing Thermodynamics and Auxiliary Information[J]. Nucleic Acids Research,1981,9:133-148.
    [81]Tinoco Jr. and C. Bustamante, How RNA Folds[J]. J. Molecular Biology,1999,293(1):271-281.
    [82]Tinoco, O.C. Uhlenbeck, and M.D. Levine, Estimation of Secondary Structure inRibonucleic Acids[J]. Nature,1971,230:267-362.
    [83]R. Nussinov, G. Pieczenik, J.R. Griggs, and D.J. Kleitman,Algorithms for LoopMatchings[J]. SIAM J. Applied Math.,1978,35:68-82.
    [84]M. Zuker, On Finding All Suboptimal Foldings of an RNA Molecule[J]. Science,1989,244:48-52.
    [85]M. Zuker, Prediction of RNA Secondary Structure by Energy Minimization[M].Computer Analysis of Sequence Data, A.M. Griffin and H.G. Griffin, eds.,Humana Press, July1994, pp.267-294.
    [86]J.A. Jaeger, D.H. Turner, and M. Zuker, Improved Predictions of SecondaryStructures for RNA[J]. Biochemistry, Oct.1989,86:7706-7710.
    [87]M. Zuker, J.A. Jaeger, and D.H. Turner, A Comparison of Optimal and SuboptimalRNA Secondary Structures Predicted by Free Energy Minimization with StructuresDetermined by Phylogenetic Comparison[J]. Nucleic Acids Research,1991,19(10):2707-2714.
    [88]R.B. Lyngs and C.N.S. Pedersen, Pseudoknots in RNA Secondary Structures[C],Proc. Fourth Ann. Int’l Conf. Computational Molecular Biology (RECOMB’00),2000, pp.201-209.
    [89] S.R. Eddy, How Do RNA Folding Algorithms Works?[J]. Nature Biotechnology,2004,22(11):1457-1458.
    [90]J. Reeder, M. Ho¨schsmann, M. Rehmsmeier, B. Vo, and R. Giegerich, BeyondMfold: Recent Advances in RNA Bioinformatics[J], J. Biotechnology,2006,124(1):41-55.
    [91]B.A. Shapiro and J. Navetta, A Massively-Parallel GeneticAlgorithm for RNASecondary Structure Prediction[J]. J. Supercomputing,1994,8:195-207.
    [92]J.H. Chen, S.Y. Le, and J.V. Maizel, Prediction of Common Secondary Structuresof RNAs: A Genetic Algorithm Approach[J],Nucleic Acids Research,2000,28:991-999.
    [93]A.P. Gultyaev, F.H.D. van Batenburg, and C.W.A. Pleij, The Computer-Simulationof RNA Folding Pathways Using a Genetic Algorithm[J]. J. Molecular Biology,1995,250:37-51.
    [94]K.C. Wiese, Al.A. Deschenes, and A.G. Hendriks,Rnapredict-an evolutionaryalgorithm for RNA secondary structure prediction[J]. IEEE/ACM Transactions onComputational Biology and Bioinformatics,2008,5:25-41.
    [95]K.C. Wiese and A. Hendriks, Comparison of P-RnaPredic and mfold—Algorithmsfor RNA Secondary Structure Prediction[J]. Bioinformatics,2006,22(8):934-942.
    [96]H.H. Tsang and K.C. Wiese, SARNA-Predict: Accuracy Improvement of RNASecondary Structure Prediction Using Permutation-Based Simulated Annealing,IEEE/ACM Transactions on Computational Biology and Bioinformatics,2010,7(4).
    [97]H.H. Tsang and K.C. Wiese, SARNA-Predict: A Simulated Annealing Algorithmfor RNA Secondary Structure Prediction[C]. Proc. IEEE Symp. ComputationalIntelligence in Bioinformatics and Computational Biology (CIBCB’06),2006, pp.466-475.
    [98]Y.N. Liu, H. Dong, H. Zhang, G. Wang, Z. Li, H.L. Chen, Prediction of RNASecondary Structure Based on Particle Swarm Optimization[J]. Chem. Res.Chinese Universities,2011,27(1):108-112.
    [99]J. Yu, C.H. Zhang, Y.N. Liu, X. Li, Simulating the Folding Pathway of RNASecondary Structure Using the Modified Ant Colony Algorithm[J]. Journal ofBionic Engineering,2011, vol.7:382–389.
    [100]Hao Wu, Yan-feng Shi, Xing Jin, Gang Wang, Hao Dong. A Fuzzy AdaptiveParticle Swarm Optimization for RNA Secondary Structure Prediction [C].International Conference on Information Science and Technology, March26-28,Nanjing, Jiangsu, China,2011, pp:1390-1393.
    [101]Chong Xing, Gang Wang, Yao Wang,Zhaohua Ji, Wei Shen, Yanchun Liang.Anovel method for RNA secondary structure prediction[C]. Proc. the7thInternational Conference on Natural Computation (ICNC'11),2011, pp:1162-1166.
    [102]纪震,廖惠连,吴青华,粒子群算法及应用[M],科学出版社,北京,2009.
    [103]Reynolds. Flocks, herds, and schools: A distributed behavioral model[J].Computer Graphic,1987,21(4):25-34.
    [104]Kennedy J, Eberhart R.Particle swarm optimization[C].Proceeding of IEEEInternational Conference on Neural Networks, Psicataway: IEEE Service Center,1995:1942-1948.
    [105]李宁,付国江,库少平,等.粒子群优化算法的发展与展望[J].武汉理工大学学报信息与管理工学版,2005,27(2):26-29.
    [106]Boyd R, Recharson P. Culture and the evolution process [M]. Chicago: Universityof Chicago Press,1985.
    [107]Y. Shi and R. C. Eberhart, A modified particle swarm optimizer[C]. Proceedingsof the1998IEEE International Conference on Evolutionary Computation,1998,pp.69-73.
    [108]R. C. Eberhart and J. Kennedy, A new optimizer using particle swarm theory [C],Proceedings of the Sixth International Symposium on Micro Machine and HumanScience,1995, pp.39-43.
    [100]J. Kennedy and R. C. Eberhart, Particle swarm optimization [C], IEEEInternational Conference on Neural Networks,1995, vol.4, pp.1942-1948.
    [110]Kennedy J, Eberhart R. A discrete binary version of the particle swarmalgorithm[C]. IEEE International Conference on Systems, Man, and Cybernetics,1997, pp.4104-4108.
    [111]Pearson, K. On Lines and Planes of Closest Fit to Systems of Points in Space[J].Philosophical Magazine,2(6):559–572.
    [112]Hao Dong, Yuanning Liu, Hao Zhang, Gang Wang. A Method of RNA SecondaryStructure Prediction Based on Hidden Markov Model [J]. Journal of ComputerResearch and Development,2012,49(4):812-817.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700