非编码RNA基因识别模型的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
生物信息学是计算机科学与生命科学相结合形成的一个研究领域。它通过用计算机科学的理论和相关算法对生命科学领域内的数据进行加工、存储、检索和分析。随着生物序列数据的快速增长,如何运用高效的算法来处理这些数据已经得到了越来越多的关注。基因识别正是其中一个焦点。它是指在DNA序列中识别出所有编码蛋白质的区域和所有与基因表达调控有关的不编码蛋白质的区域。
     本文主要研究非编码核糖核酸(non-coding ribonucleic acid, ncRNA)的基因识别问题。研究的方法采用上下文敏感隐马尔可夫模型(context-sensitive hidden markov model,csHMM)的技术,结合物种进化关系,尝试找出一种能够从基因组中识别非编码RNA基因的新方法。
     本文的重点是利用上下文敏感隐马尔可夫模型和物种进化关系构建非编码RNA的二级结构模型,并实现了非编码RNA基因的理论预测。首先,利用csHMM构建基本的非编码RNA二级结构模型。其次,从代表物种进化关系的氨基酸置换矩阵推导出上下文敏感隐马尔可夫模型的生成概率,从而构建新的非编码RNA识别模型框架pair-csHMM。再次,修改csHMM的Inside-Outside算法优化模型参数,使模型能从已知序列中提取二级结构特征。最后,用优化后的模型去预测非编码RNA基因,并实现了原型系统。
     研究的难点在于反映非编码RNA特征的模型的建立,及其参数的优化。本文把非编码RNA的二级结构特征和物种进化过程中的保守性融合到非编码RNA模型中,使模型能更好地反映非编码RNA的特征。并且修改了csHMM的Inside-Outside算法以训练新构建的非编码RNA模型,使模型更精确。实际的测试结果表明,所构建的模型比较合理地反映了非编码RNA的特征,经过优化后可以用于非编码RNA基因的识别。
     本文的主要创新点:(1)在非编码RNA识别中使用上下文敏感隐马尔可夫模型。实验结果表明,该模型提高了非编码RNA基因识别的特异性;(2)在csHMM模型中引入物种进化关系。实验结果表明,两比对基因组的进化距离与模型的进化距离越近识别效果越好;(3)实现了非编码RNA基因识别原型系统RNA-cs。
Bioinformatics is a field which combines computer science and life science. It processes、stores、searches and analysises data produced in life science using theories and correlative algorithms of computer science. With increaing of biologic sequence data, more and more focuses have been put on processing data by efficient algorithms. Gene finding is one of these focuses which is to predict either the regions coding proteins or the regions regulating gene expression but do not coding any protein from DNA sequences.
     The finding question of non-coding RNA (non-coding ribonucleic acid, ncRNA) genes was studied in the thesis. Its method is using the technique of csHMM (context-sensitive hidden markov model) and the species evolutionary relationship to set up a new computational framework which is able to distinguish non-coding RNA genes from genome.
     The strong emphasis of the thesis was laid on using csHMM model and species evolutionary relationship to set up the secondary structure model of non-coding RNA. Firstly, basic secondary structure model of non-coding RNA was set up using csHMM model. Secondly, probabilities of emitting paired residues were computed from amino acid mutation matrix representing the species evolutionary relationship to form a new computational framwork of non-coding RNA gene finding called pair-csHMM. Thirdly, we modified the Inside-Outside algorithm of csHMM model to optimize pair-csHMM, whose aim was to distill feature of RNA secondary structure from known RNA sequence. Finally, a prototype system was implemented to find non-coding RNA gene.
     The main difficulties encountered in the thesis were the establishment of the non-coding RNA model and its parameter optimization. Not only the secondary structure conservation of non-coding RNA but also its sequence conservation between evolutionary processes was integrated into the non-coding RNA model using csHMM model. And the Inside-Outside algorithm of csHMM was modified for training the non-coding RNA model to make it more accurate. The result of testing indicates that the new framwork can be used to find non-coding RNA genes.
     The new ideas were summarized as follow: (1)The csHMM model was used to predict non-coding RNA genes. The result testing indicates that the model improves the differential of non-coding RNA gene finding. (2)The species evolutionary relationship was introduced into pair-csHMM model. The result of testing indicates that the nearer the evolutionary distance between the aligned genome and non-coding RNA model the more it can predict non-coding RNA genes. (3)A prototype system called RNA-cs was implemented to predict non-coding RNA genes.
引文
[1] 张春霆. 生物信息学的现状与展望[J]. 世界科技研究与发展,2000,22(6):17-20.
    [2] 史忠植. 知识发现[M]. 北京:清华大学出版社, 2002.
    [3] Burset M , Guigo R. Evaluation of gene structure prediction programs[J]. Genomics, 1996,34:353–367.
    [4] Claverie JM. Computational methods for the identification of genes in vertebrate genomic sequences[J], Human Molecular Genetics,1997, 6(10):1735-1744.
    [5] Xu Y,Mural R.J,Einstein J.R,Shah M.B,Uberbacher E.C. GRAIL: A multi-agent neural network system for gene identification[C]. Proceedings of the Institute of Electrical and Electronics Engineers,1996,84(10):1544-1552.
    [6] 周海延. 基因识别计算方法的回顾与展望[J]. 绵阳经济技术高等专科学校学报, 2002,19(4):1-6.
    [7] Legouis R,Hardelin J.P,Levilliers J,Claverie J.M,Compain S,Wunderle V,Millasseau P,Le Paslier D,Cohen D,Caterina D. The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules[J]. Cell,1991,67(2):423-35.
    [8] Igor BR,Milanesi L,Nikolay,Kolchanov. Gene structure prediction using information on homologous protein sequence[J]. Bioinformatics,1996,12(3):161-170.
    [9] 张春霆. 用几何学方法分析 DNA 序列[R]. 中国科学基金,1999,(3):152-153.
    [10] 陈润生等. 用神经网络法预测 mRNA 的剪接位点[J]. 生物物理学报,1993,9 (1): 127-131.
    [11] 蔡煜东等. 用神经网络方法识别真核基因内含子并确定基因的编码序列[J]. 生物化学与生物物理学报,1993,25(2):135-140.
    [12] 孟捷等. 蛋白质编码区与非编码区的特征与识别[J]. 生物数学学报,1996,11 (2): 75-82.
    [13] 严繁妹,骆志刚,管乃洋,丁凡,王金华. 基于并行的同源 RNA 序列快速搜索算法[J]. 微电子学与计算机. 2006,9(23):1-4.
    [14] 齐震. Non- coding RNA 基因预测及用信息差异度进行种系进化分析[R]. 北京:中国科学院生物物理研究所, 2004.
    [15] Eddy S.R. Noncoding RNA genes in Dosage Compensation and Imprinting[J]. Cell,2000,103:9-12.
    [16] Hocvitz H.R,Sulston J.E. Isolation and genetic:characterization of cell-lineagemutants of the nematode Caenorhabditis elegans[J]. Genetics,1980,96:435-454.
    [17] Reinhart B.J, et al. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans[J]. Nature,2000,403: 901-906.
    [18] Curt. Hunter CP Gene silencing: shrinking the black box of RNAi[J]. Biol, 2000,10:137-140.
    [19] Carthew R.W,Curr,Opin. Gene silencing by double-stranded RNA[J]. Cell Biol,2001,13: 244-248.
    [20] Sharp P.A. RNA interference[J]. Genes Dev,2001,15:485-490.
    [21] Vance V. Vaucheret H.,RNA silencing in plants -defense and counterdefense[J]. Science,2001,292:2277-2280.
    [22] 靳德明, 彭卫东, 史红梅. 现代生物学基础[M]. 北京:高等教育出版社, 2000.
    [23] Attwood T.K,Parry D.J,Smith. 生物信息学概论[M]. 罗静初译,北京:北京大学出版社,1999.
    [24] Eddy S.R. Non-coding RNA genes and the modern RNA world[J]. Nat Rev Genet, 2001. 2(12): p. 919-29
    [25] Mattick J.S. Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms[J]. Bioessays,2003,25(10):930-9.
    [26] Dermitzakis E.T,et al. Evolutionary discrimination of mammalian conserved non-genic sequences(CNGs)[J]. Science,2003,302(5647):1033-5.
    [27] Saha S,et al. Using the transcriptome to annotate the genome[J]. Nat Biotechnol,2002,20(5):508-12.
    [28] Kapranov P,et al. Large-scale transcriptional activity in chromosomes 21 and 22[J]. Science,2002,296(5569):916-9.
    [29] Wright F.A,et al. A draft annotation and overview of the human genome[J]. Genome Biol,2001,2(7):1-18.
    [30] Zhuo D,et al. Assembly,Annotation, and Integration of UNIGENE Clusters into the Human Genome Draft[J]. Genome Res,2001,11(5):904-918.
    [31] Numata K,et al. Identification of putative noncoding RNAs among the RIKEN mouse full-length cDNA collection[J]. Genome Res,2003,13(6B):1301-6
    [32] Cawley S,et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs[J]. Cell,2004,116:499-509.
    [33] Kim P,et al. ECgene:genome annotation for alternative splicing[J]. Nuc Acids Res,2005,33:75-79.
    [34] Hall I,et al. Establishment and maintenance of a heterochromatindomain[C]. Pro Natl Acad Sci , USA , 2003,100:193-198.
    [35] Bartel D.P. MicroRNAs:Genomics,Biogenesis,Mechanism,and Function[J]. Cell,2004,116(2):281-97.
    [36] Olsen PH,Ambros V. The lin-4 regulatory RNA controls developmental timing in Caenorhabditis elegans by blocking LIN-14 protein synthesis after the initiation of translation[J]. Dev Biol,1999,216(2):671-80.
    [37] Kole R., Sazani P., Antisense effects in the cell nucleus: modification of splicing[C]. Proc Natl Acad Sci ,USA,2002,99:9456-9461.
    [38] Dennis C. The brave new world of RNA[J]. Nature,2002,418(122-124).
    [39] Lander ES.,et al. Initial sequencing and analysis of the human genome[J]. Nature,2001,409(6822):860-921.
    [40] Venter J,et al. The sequence of the human genome[J]. Science,2001,291: 1304-1351.
    [41] Liang F,et al. Gene index analysis of the human genome estimates approximately 120,000 genes[J]. Nat Genet, 2000,25(2):239-40.
    [42] Ennifar E,Nikulin A,Tishchenko S,Serganov A,Nevskaya N,Garber M,Ehresmann B,EhresmannC,Nikonov S,Dumas P. The crystal structure of UUCG tetraloop[J]. J.Mol.Biol., 2000,304:35-42.
    [43] Jucker F.M,Heus H.A,Yip P.F,et al. A network of heterogeneous hydrogen bonds in GNRA tetraloops[J]. J.Mol.Biol.,1996,264,968-980.
    [44] Moore P.B, Stmctural motifs in RNA[J]. Annu Rev Biochem,1999,68: 287-300.
    [45] Karlin S,Campbell A.M,et al. Comparative DNA analysis across diverse genomes[J]. Annu.Rev.Genet,1998,32:185-225.
    [46] 刘海飞,史定华,王翼飞. 日新月异的 RNA 二级结构[J]. 自然杂志,2004,25(6): 314-322.
    [47] Eddy S.R. Computational genomics of non coding RNA genes[J]. Cell,2002,109(2): 137-400.
    [48] ftp://ftp.ensembl.org/pub/current_human/data/fasta/rna.
    [49] Sam G.J,Alex Bateman,Mhairi Marshall,Ajay Khanna,Eddy S.R. Rfam:an RNA family database[J]. Nucleic Acids Research,2003,31(1):439-411.
    [50] http://rfam.janelia.org/
    [51] http://microrna.sanger.ac.uk/sequences/index.shtml.
    [52] http://www-snorna.biotoul.fr/.
    [53] http://noncode.bioinfo.org.cn/index.htm.
    [54] Yoon B.J, P.P.Vaidyanathan. Hmm with auxiliary memory: a new tool for modeling RNA secondary structures[C]. Proc.38th Asilomar Conference on Signals, Systems,and Computers,Monterey,CA,Nov 2004.
    [55] Rabiner L.R,Juang B.H. An Introduction to Hidden Markov Models [J]. IEEE ASSP Magazine, 1986, 3(1), 4-16.
    [56] Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition [J]. Proceedings of the IEEE,1989,77(2),257-285.
    [57] Durbin R, Eddy S.R. Krogh A. and Mitchison G., Biological Sequence Analysis [M]. Cambridge University Press, 1998.
    [58] Krogh A. An introduction to hidden Markov models for biological sequences, In Salzberg S., Searls D., and Kasif S., eds., Computational Biology: Pattern Analysis and Machine Learning Methods [M]. Elsevier. Chapter 4, 1998.
    [59] Baldi P,Brunak S. Bioinformatics: The machine learning approach [M]. 2nd edition, The MIT Press, 2001. 张东晖, 黄颖, 蔡军, 等译, 生物信息学: 机器学习方法 [M], 第二版, 中信出版社, 2003.
    [60] Yoon B.J , P.P.Vaidyanathan. Context-sensitive hidden Markov models for modeling long-range dependencies in symbol sequences[C]. IEEE Transactions on Signal Processing,2006,54:4169-4184.
    [61] http://www.lmbe.seu.edu.cn/chenyuan/xsun/bioinfomatics/Web/Index.html.
    [62] Rivas E,Eddy S.R. Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs[J]. Bioinformatics,2000, 6:583-605.
    [63] http://selab.janelia.org/software.html#squid
    [64] Elena Rivas,Eddy S.R. Noncoding RNA gene detection using comparative sequence analysis[J]. Bioinformatics,2001,2(1):8.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700