一种新的RNA二级结构可视化表示及其应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
生物序列的比较和分析是当前生物信息学研究的热点之一。生物序列一般是指DNA、RNA序列或蛋白质序列。随着研究的发展,作为主要遗传物质的RNA逐渐成为研究的重点,由于RNA二级结构比一级序列具有更大的保守性,且在RNA二级结构内发现了丰富的可用于分类和系统发育分析的信息,因此对RNA二级结构的分析具有十分重要的意义和价值。本文主要以RNA二级结构之间的相似性为研究内容,分别给出了一种新的基于可视化表示的相似性分析方法和基于Lempel-Ziv复杂度的相似性分析方法,为生物序列的可视化表示和分析提供了新的途径。
     本文主要完成了以下两个方面的工作:
     (1)提出了一种新的RNA二级结构可视化表示——CZ曲线,给出了CZ曲线具有的两种性质。基于CZ曲线给出了RNA二级结构的对应点的坐标映射图,并从图中直接获取了部分RNA二级结构的相似性信息和特征序列碱基的组成情况。随后将CZ曲线应用于RNA二级结构的相似性分析,给出了相似性比较结果。根据得到的相似性矩阵,结合可凝聚的层次聚类算法给出了11种真实RNA二级结构的进化树。实验结果表明本文提出的方法不仅可以有效的分析RNA二级结构(含假结)的相似性问题,还可以正确的将不同种类的RNA二级结构进行归类。此外,该方法只需要提取特征序列对应特征曲线的几何中心来计算相似性矩阵,因此计算复杂度较低。
     (2)针对目前不同的RNA二级结构可能对应相同特征序列的问题,提出了一种新的RNA二级结构特征序列的表示方法,给出了在转换时可参照的规则。随后利用Lempel-Ziv算法在得到的新的特征序列之间进行了相似性分析,从第三章使用的数据中选取了两组作为实验数据。实验结果与相关文献的分析结果一致,表明此表示法可以有效的提取RNA二级结构的结构信息,且避免了不同的RNA二级结构可能对应相同特征序列的问题。
The comparison and analysis of the biological sequences is one of the hot spots of bioinformatics.Biological sequences generally refer to DNA, RNA or protein sequences. With the development of the research, RNA that contains the genetic information has become the focus of the research.As a matter of fact, the RNA secondary structure is more conservative than its primary sequence, and a lot of information that can be used for classification and phylogenetic analysis has been found in RNA secondary structure. Therefore, the analysis of RNA secondary structure is of great significance and value.The research content of this paper is the similarity of the RNA secondary structure.Here we propose two methods to analysis the similarity of RNA secondary structure respectively based on a new visual representation and the Lempel-Ziv complexity.This provides a new way for visualization and analysis of biological sequences.
     The main work of this paper is as follows:
     (1)We propose a new visual representation for the RNA secondary structure-CZ curve, and introduce two properties of the CZ curve. Accoding to the CZ curve we show the projection graphs of the points corresponding to the RNA secondary structures, and we can get some information of the base composition and similarity of the RNA secondary structures directly from the graphs. Then our method is applied to compute the similarity of RNA secondary structure.After showing the results of the similarity analysis between the RNA secondary structures, we utilized the similarity matrix combining the hierarchical clustering algorithms to give the phylogenetic tree for the real 11 RNA secondary structures. The results show that our method can not only effectively analyze the similarity between RNA secondary structures (including pseudoknot), but also classify the different kinds of RNA secondary structures accurately. Moreover, our method only needs the geometrical center of the characteristic curve of the RNA secondary structure to compute the distance matrix, so it has low computational complexity.
     (2)In view of the problem that different RNA secondary structures may correspond to the same characteristic sequence, we propose a new method to describe the characteristic sequence of the RNA secondary structure, and give the rules that can be referred to in the changing progress. Then we compute the similarity between the new characteristic sequences by using Lempel-Ziv complexity. We choose two data sets from paragraph 3 as our test data. The results are consistent with the analysis given in other literatures, which show our methods can effectively extract the structural information of the secondary structures, and avoid the problem that different RNA secondary structures may corresponse to the same characteristic sequence.
引文
[1]Venter J, Adams M. The Sequence of the Human Genome. Science,2001, 29(1):1304-1351
    [2]张春霆.生物信息学的现状与展望.世界科技研究与发展,2000,22(6):17-20
    [3]Michel F, Dujon B. Conservation of RNA secondary structures in two intron families including mitochondrial-, chloroplast- and nuclear-encoded members. The EMBO Journal,1983,2(1):33-38
    [4]施晓秋,孔繁胜.计算机科学在生物信息学中的应用.浙江工业大学学报.西安:西安电子科技大学,2001,29(2):161-165
    [5]王玉梅,王艳.国外生物信息学发展动态分析.科技情报开发与经济,2002,12(6):83-85
    [6]郑国清,黄静,段韶芬等.生物信息学研究进展与展望,2002,1(1):4-7
    [7]白凤兰.生物序列的图形表示及其应用:[大连理工大学博士学位论文].大连:大连理工大学,2005,2-3
    [8]刘娜.生物序列/结构的比较及进化树的构建:[大连理工大学博士学位论文].大连:大连理工大学,2007,4-5
    [9]Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology,1970,48:443-453
    [10]Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology,1981,147:195-197
    [11]Hamori E, Ruskin J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences.Journal of Biological Chemistry, 1983,258:1318-1327
    [12]Randic M, Vracko M, Lers N et al. Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters,2003, 368:1-6
    [13]Chunting Zhang, Ren Zhang. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Research,1991,19(22): 6313-6317
    [14]Hao BL. Fractals from genomes-exact solutions of a biology-inspired problem. Physica A,2000,282:225-246
    [15]Hao BL, Lee HC, Zhang SY. Fractals related to long DNA sequences and complete genomes. Chaos, Solitons and Fractals,2000,11:825-836
    [16]Zuker M. On Finding All Suboptimal Foldings of an RNA Molecule. Science. 1989,244:48-52
    [17]B Shapiro, K Zhang. Comparing multiple RNA secondary structures using tree comparisons. Computer Applications in the Biosciences,1990,6:309-318
    [18]Wen Zhu, Bo Liao, Kequan Ding. A condensed 3D graphical representation of RNA secondary structures. Journal of Molecular Structure:THEOCHEM,2005, 757:193-198
    [19]Fenglan Bai, Wen Zhu, Tianming Wang. Analysis of similarity between RNA secondary structures. Chemical Physics Letters,2005,408:258-263
    [20]Bo Liao, Tianming Wang. A 3D Graphical Representation of RNA Secondary Structures. Journal of Biomolecular Structure & Dynamics,2004,21(6):827-832
    [21]Yuhua Yao, Bo Liao, Tianming Wang. A 2D graphical representation of RNA secondary structures and the analysis of similarity/dissimilarity based on it. Journal of Molecular Structure,2005,755:131-136
    [22]Jie Feng, Tianming Wang. A 3D graphical representation of RNA secondary structures based on chaos game representation, Chemical Physics Letters,2008, 454:355-361
    [23]Otu H H, Sayood K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics,2003,19:2122-2130
    [24]Lempel A, Ziv J. On the complexity of finite sequences. IEEE Transactions on Information Theory,1976,22:75-81
    [25]Jiawei Luo, Bo Liao, Renfa Li et al. RNA secondary structure 3D graphical presentation without degeneracy. Journal of Mathematical Chemistry,2006, 39(3/4):629-636
    [26]Bo Liao, Jiawei Luo, Renfa Li et al. RNA secondary structure 2D graphical presentation without degeneracy. International Journal of Quantum Chemistry, 2006,106:1749-1755
    [27]Bo Liao, Kequan Ding, Tianming Wang. On A Six-Dimensional Representation of RNA Secondary Structures. Journal of Biomolecular Structure & Dynamics. 2004,22(4):1-9
    [28]于正刚,赵熙强.RNA二级结构的一种新的图形表示及其应用.中国海洋大学学报,2009,39(2):349-352
    [29]Milan Randic. On characterization of DNA primary sequences by a condensed matirx. Chemical Physics Letters,2000,317:29-34
    [30]M Randic, M Vracko, A Nandy et al. On 3-D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization. J.Chem.Inf.Comput. Sci,2000,40:1235-1244
    [31]Chun Li, Jun Wang. New Invariant of DNA sequences. Journal of Chemical Information and Modeling,2005,45(1):115-120
    [32]Xizhen Zhang, Jiawei Luo, Li Yang. New invariant of DNA Sequences Based on 3DD-Curves and its Application on Phylogeny. Journal of Computational Chemistry,2007,5:2342-2346
    [33]Sneath PHA, Sokal RR. Numerical taxonomy-the principles and practice of numerical classification. San Francisco:W.H. Freeman and Company,1973.
    [34]Felsenstein J. Evolutionary trees from DNA sequences:a maximum likelihood approach. Journal of Molecular Evolution,1981,17:368-376
    [35]Camin JH, Sokal RR. A method for deducing branching sequences in phylogeny. Evolutioin.1965,19:311-326
    [36]Stuart GW, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics,2002,18:100-108
    [37]Li M, Badger J H, Chen X et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics,2001,17: 149-154
    [38]V Bafna, S Muthukrishnan, R Ravi. Comparing similarity between RNA strings, Proc. Combinatorial Pattern Matching Conference 95, Lecture Notes in Computer Science,1995,1(937):1-14
    [39]F Corpet, B Michot, RNAlign program:alignment of RNA sequences using both primary and secondary structures. Computer Applications in the Biosciences, 1995,104:389-399
    [40]B Shapiro. An algorithm for comparing multiple RNA secondary structures. Computer Applications in the Biosciences,1988,43:387-393
    [41]IL Hofacker, SHF Bernhart, PF Stadler. Alignment of RNA base pairing probability matrices. Bioinformatics,2004,20:2222-2227
    [42]JS Mc Caskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers,1990,29:1105-1119
    [43]Chunting Zhang. A Symmetrical Theory of DNA Sequences and Its Appliantions. Journal of Theoretical Biology,1997,187:297-306
    [44]Ren Zhang, Chunting Zhang. Identification of replication origins in archaeal genomes based on the Z-curve method, Archaea 1,335-346
    [45]Chunting Zhang, Ju Wang. Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Research,2000,28(14):2804-2814
    [46]Jiawei H, Micheline K. Data Mining:Concepts and Techniques. Simon Fraser University:Morgan Kaufmann Publishers,2000,12-20,236-245
    [47]Hiro H, Osawa S. Evolutionary change in 5S rRNA secondary structure and a phylogenetic tree of 54 5S rRNA species. Proceedings of the National Academy of Sciences,1979,76:381-385
    [48]Hiro H, Osawa S. Evolutionary change in 5S rRNA secondary structure and a phylogenetic tree of 352 5S rRNA species. Proceedings of the National Academy of Sciences,1986,19:163-172
    [49]Bo Liao, Wen Zhu, Pengcheng Li. On a four-dimensional representation of RNA secondary structures. Journal of Mathematical Chemistry,2007,42(4): 1015-1022
    [50]Yi Zhang, Jiqing Qiu, Lianqing Su. Comparing RNA secondary structures based on 2D graphical representation. Chemical Physics Letters,2008,458:180-185
    [51]Chun Li, Lili Xing, Xin Wang. Analysis of similarity of RNA secondary structures based on a 2D graphical representation. Chemical Physics Letters, 2008,458:249-252
    [52]廖波,王天明.RNA二级子结构的计数.生物数学学报,2004,19(4):497-504
    [53]Milan Randic, Dejan Plavsic. Novel spectral representation of RNA secondary structure without loss of information. Chemical Physics Letters,2009,476: 277-280
    [54]Wenjie Shu, Xiaochen Bo, Zhiqiang Zheng et al. A novel representation of RNA secondary structure based on element-contact graphs. BMC Bioinformatics, 2008,9:188-206
    [55]Ambarnil Ghosh, Ashesh Nandy, Papiya Nandy. Computational analysis and determination of a highly conserved surface exposed segment in H5N1 avian flu and H1N1 swine flu neuraminidase. BMC Structural Biology,2010,10:6-16

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700