蛋白质结构预测模型研究

英文题名：Study on Models of Protein Structure Prediction
作者：罗亮
论文级别：博士
学科专业名称：系统分析与集成
中文关键词：蛋白质结构预测 ; 最短路径 ; DNA计算 ; 最大权团 ; 条件随机场 ; 二硫键
英文关键词：protein structure prediction ; the shortest path ; DNA computing ; maximal clique ; conditional random fields ; disulfide bonding
学位年度：2010
导师：许进
学科代码：071102
学位授予单位：华中科技大学
论文提交日期：2010-05-01

摘要

近20年来,随着生物数据量呈指数级增长,产生了新的交叉学科——生物信息学。而蛋白质结构与功能预测是生物信息学的一项核心研究内容,它的研究不仅能帮助人们了解蛋白质折叠的形成机制,更对实验生物学起着重要的指导作用。
     蛋白质结构预测的关键在于建立有效的预测模型并给出合理快速的预测算法,然而蛋白质空间结构复杂,各种结构的形成原因并不完全清楚,因此目前的预测模型和算法都有各自的局限性,预测模型的准确度和算法求解的复杂度之间也互相制约。针对这些问题,本文进行了深入的研究,提出和改进了一些蛋白质结构预测模型及方法。
     图论在蛋白质结构预测相关问题的研究中有着重要作用。本文将预测蛋白质二级结构问题转换成求解一个图的最短路径问题,每3个顶点表示序列中的一个氨基酸残基可能形成的二级结构,边表示可能的残基连接,并设计一个函数对边进行赋权,则这个赋权图中的最短路径对应该蛋白质的二级结构。应用这个方法,对几组测试集进行了预测,取得了较好的预测结果,并对模型中环境参数的选取进行了讨论。
     蛋白质序列数据的冗余是训练蛋白质结构预测模型需要避免的问题。本文将图论中最大团的概念引入冗余处理的算法中,利用最大团的成熟算法改进了蛋白质数据冗余的处理方法,并对几种蛋白质数据进行了处理,取得比较好的结果。
     DNA计算是一种全新的计算模型,本文试图将DNA计算引入到蛋白质结构预测中,建立了蛋白质结构预测的质粒DNA计算模型,为蛋白质结构预测提出一种全新的研究思路。该模型首先将一段待确定空间构型的侧链或主链转换成一个赋权图的顶点,顶点和边根据一些安排好的标准赋权,然后结合最大权团问题的质粒DNA计算模型,建立蛋白质预测问题的DNA计算模型,最后对该质粒DNA计算模型的编码进行了研究,给出了一个编码工具。
     概率图模型是蛋白质结构预测的一类有效的模型。本文将20种氨基酸进行分类,通过统计β折叠的典型形成模式,将3-状态隐马尔可夫预测模型扩展为9状态,有效的提高了β折叠的预测精度。条件随机场是最近提出的一种概率图模型,本文构建了一种基于条件随机场的蛋白质结构预测模型,并给出了此类条件随机场的训练及解码算法。同时利用多序列对比程序PSI-BLAST把蛋白质序列转化为表示进化信息的序列模体以提高预测的精度,最后给出预测结果并进行比较分析。
     在蛋白质结构预测的研究中,一个重要的问题就是正确预测二硫键的连接,二硫键的准确预测可以减少蛋白质构型的搜索空间,有利于蛋白质的3D结构的预测。本文成功地将LVQ神经网络方法引入蛋白质的二硫键的预测工作中。结果表明蛋白质的二硫键的连接与半胱氨酸的局域序列模式有重要联系,可以由蛋白质的一级结构序列预测该蛋白质的二硫键的连接方式,应用这个方法对蛋白质结构的二硫键进行了预测取得了良好的结果。
     HP模型是一种简化的蛋白质结构预测模型,本文对HP模型进行改进,根据氨基酸残基的亲疏水特性以及理化特性将氨基酸残基分为4类,把蛋白质序列简化为一个4元序列,并给出一种通过4元序列能量最低的结构来预测蛋白质的空间结构的简化模型。最后使用一种改进的模拟退火算法对4种不同长度的蛋白质进行二维结构预测,比过去HP模型得到了更小的能量构型,说明该简化模型比HP模型更加精确。同时该方法也可以应用于蛋白质的三维结构预测。
Exponentially exploding bioinformatics data has brought a new multidisciplinary research area-bioinformatics. One of major research issues in bioinformatics is on protein structure prediction based on protein sequence. This interdisciplinary field begs for knowledge of mathematics, computer science, information science, physics, system science, management science as well as biology. Concerning the problem of protein structure prediction, some new models and improved models are given in this dissertation.
     Graph theory plays a key role in the field of prediction of protein structure. In this dissertation, a method based on the shortest path of a graph is proposed. Three vertices of the graph give a possible secondary structure of a residue, and each edge of the graph is assigned a weight by a function. This path equated the corrected secondary structure. By this method, Several groups of proteins is tested and the result showed that this method was feasible. Finally the selection of parameter is discussed.
     DNA computing is a new computer model. This dissertation introduces DNA computing in proteins structure prediction. Each possible conformation of a residue in an amino acid sequence is represented using the notion of a node in a graph. Each node is given a weight based on the degree of the interaction between its side-chain atoms and the local main-chain atoms. Proteins structure prediction problem is mapped to find the maximal sets of completely connected nodes (cliques) in a graph and then using DNA computing model can find the maximal cliques.
     Probabilistic graphic model is an effective protein structure prediction model. By introducing a hidden state variable, a hiden Conditional Random Fields (HCRFs) is builded and used in the problem of protein structure prediction. A method of constructing the model and the algorithms is given to train and decode the model and use the model to predict the second structure of a famous protein dataset (CB513). Finally the results are compared with some other methods.
     An important problem in protein structure prediction is the correct location of disulfide bonding in proteins. The location of disulfide bonding can strongly reduce the search in the conformational space of protein structure. Therefore the correct prediction of the disulfide bonding starting from the protein residue sequence may also help in predicting its 3D structure. In this paper the LVQ artificial neural network method is applied to predict the disulfide bonding of protein structure. The local sequence arrangement of cysteine is of great significance to the disulfide bonding. Therefore the disulfide bonding can be predicted by its primary structure. This method was used to predict disulfide bonding in protein structure and a fine result was got.
     HP model is a simplified model of protein structure prediction.20 kinds of protein residues is classed into four groups. A protein sequence is converted to a new sequence including four alphabets. And then by searching the lowest energy of the new sequence we construct a protein structure prediction model. Simulated annealing algorithm is used for this model and the result gets the lower energy than using the HP model. The model can extend in predicting protein structure in 3D.

引文

[1]戴汝为.复杂巨系统——一门21世纪的科学.学会月刊,1997,11：3-9.
    [2]Dulbecco R. A turning point in cancer research:sequencing the human genome. Science,1986,231:1055-1056.
    [3]Baxevanis A D. The Molecular Biology Database Collection:an updated compilation of biological database resources. Nucleic Acids Res,2001,29(1):1-10.
    [4]Baxevanis A D. The Molecular Biology Database Collection:2003 update, Nucleic Acids Res,2003,31(1):1-12.
    [5]生物信息学引论.http://www.lmbe.seu.edu.cn/chenyuan/Web/CharpterOne/1.1.html.
    [6]罗辽复.生物信息学的兴起和生命科学的理性化.合肥学院报(自然科学版),2004,14：1-5.
    [7]钱玉梅,高珍贵,陈文觅.后基因组时代相关概念及技术研究进展.宿州学院学报,2007,22：101-103.
    [8]孙啸,陆祖宏,谢建明.生物信息学基础.北京：清华大学出版社,2006,8：249-251.
    [9]张猛,于军.人类基因组计划与人类健康.医学研究杂志.2007,36：7-8.
    [10]Latek D, Ekonomiuk D, Kolinski A. Protein structure prediction:Combining de novo modeling with sparse experimental data. Comput Appl Biosci.1997,1(13):291-295.
    [11]Berg J M, Tymoczko J L, Stryer L. Biochemistry, Fifth Ed. W H Freeman and Company New York,2002,104-105.
    [12]Pauling L, Corey R. B, Branson H. R. The structure of proteins:two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Nat. Acad. Sci,1951,37: 205-210.
    [13]Kendrew J C, Dickerson R B, Strandberg B E, et al. Structure of myoglobin:a three-dimensional Fourier synthesis at 2 A resolution. Nature,1960,185:422-427.
    [14]王勇献.蛋白质二级结构预测的模型与方法研究.国防科学技术大学博士学位论文,2004.
    [15]Berman H M, Westbrook J, Feng Z, et al. The protein data bank. Nucleic Acids Research,2000,28:235-242.
    [16]牛卫东,潘宪明.蛋白质结构预测.世界科技研究与发展,1998,01：55-56.
    [17]殷志祥.蛋白质结构预测方法的研究进展.计算机工程与应用,2004,20：54-57.
    [18]Stemberg M. Protein Structure Prediction:A practical approach. Oxford University Press, NewYork,1996.
    [19]Peitsch M.C, Jongeneel V. A 3-dimensional model for the CD40 ligand predicts that it is a compact trimer similar to the tumor necrosis factors. Int Immunol.1993,5: 233-238.
    [20]丁达夫,汤海旭,张保红.基于结构比较的蛋白质模建系统及其评估.生物物理学报,1995,11：416-428.
    [21]Lau K F, Dill K A. A lattice statistical mechanics model of the conformation and sequence spaces of proteins. Macromolecules,1989.22:3986.
    [22]Helles G A comparative study of the reported performance of ab initio protein structure prediction algorithms. J R Soc Interface,2008,5 (21):387-396.
    [23]F H, Stillinger T, Head-Gordon C L. Hirshfeld. Toy model for protein folding. Phys. Rev.1993, E 48:1469.
    [24]Head-Gordon T, Stillinger F H. Optimal neural networks for protein-structure prediction. Phys. Rev,1993, E 48:1502.
    [25]Helles G A comparative study of the reported performance of ab initio protein structure prediction algorithms. J R Soc Interface, April 6,2008; 5(21):387-396.
    [26]Finkelstein A V, Ptitsyn O B. Why do globular Proteins fit the limited set of folding patterns. Prog. Biophys. Molec, BioL,1987,50:171-190.
    [27]Chothia C, One thousand families for the molecular biologist, Nature,1992,357: 543-544.
    [28]Chou P Y, Fasman G D. Conformatonal parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry,1974,13: 211-222.
    [29]Gamier J, Osguthorpe D J, Robson B. Analysis of accuracy and implications of simple methods for predictiong the secondary structure of globular proteins. J. Mol. Biol,1978,120:97-120.
    [30]Levin J, Robson B, Gamier J. An algorithm for secondary structure determination in preoteins based on sequence similarity. FEBS Lett,1986,205:303-308.
    [31]Nishikawa K, Ooi T. Amino acid sequence homology applied to the prediction of protein secondary structures and joint prediction with existing methods. Biochim Biophys Acta,1986,871:45-54.
    [32]Qian N, Sejnowski T. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol,1988,202:865-884.
    [33]Holley L H, Karplus M. Protein secondary structure prediction with a neural network. Proc Natl. Acad. Sci,USA,1989,86(1):152-156.
    [34]Asai K, Hayamizu S, Handa K. Prediction of protein secondary structure by the hidden Markov model. Comput. Appl. Biosci,1993,9(2):141-146.
    [35]Zvelebil M J, Barton G.J, Taylor W R, et al. Prediction of protein secondary strecture and active sites using the alignment of homologous sequences. J. Mol. Biol,1987, 195(4):957-961.
    [36]Frishman D, Argos P. Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Engineering,1996,9(2): 133-142.
    [37]Frishman D, Argos P. Seventy-five percent accuracy in protein secondary structure prediction. Proteins,1997,27(3):329-335.
    [38]Salamov A A, Solovyev V.V. Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. J. Mol. Biol,1995, 247(1):11-15.
    [39]King R D, Sternberg M J. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Sci,1996,5(11): 2298-2310.
    [40]Rost B, Sander C, Prediction of protein secondary structure at beyyer than 70% accuracy. J. Mol. Biol,1993,232(2):584-599.
    [41]Cuff J A, Barton G.J. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins,2000,40(3):502-511.
    [42]Jones D T, Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol,1999,292(2):195-202.
    [43]Pierre Baldi, Soren Brunak, Paolo Frasconi, et al. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics,1999,15(15):937-946.
    [44]Pierre Baldi, Soren Brunak, Paolo Frasconi, et al. Bidirectional Dynamics for Protein Secondary Structure Prediction. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCA199), Stockholm, Sweden,1999.
    [45]Walker R C, Raman S, Baker D. High Resolution, High Throughput Protein Structure Prediction using IBM Blue Gene Supercomputers:Predicting CASP Targets in Record Time. Supercomputing 2006, Tampa, FL.
    [46]Bystroff C, Thorsson V, Baker D, HMMSTR:a hidden Markov model for local sequencestructure correlations in proteins. J. Mol. Biol,2000,301:173-190.
    [47]Girdhar Y, Bystroff C, Akella S, Carlson E. Efficient Sampling of Protein Folding Pathways using HMMSTR and Probabilistic Roadmaps.2005 IEEE Computational Systems Bioinformatics Conference (CSB 2005), poster, Stanford, CA, August 2005.
    [48]Lim V I. Algorithm for perdiction of a-helical and P-structural regions in globular proteins. J. Mol. Biol,1974,88:873-894.
    [49]Yi T M, Lander E S. Protein secondary structure prediction using nearest-neighbor methods. J. Mol. Biol,1993,232(4):1117-1129.
    [50]Canproux A. C, Tuffery P, Buffat L, et al. Analyzing patterns between regular secondary structure using short structural building blocks defined by a hidden Markov model. Theor. Chen. Acc,1999,101:33-40.
    [51]Hua S, Sun Z. A novel method of protein secondary structure prediction with high segment overlap measure:support vector machine approach. J. Mol. Biol,2001, 308(2):397-407.
    [52]Herbert S, Wilf. Algorithms and Complexity. University of Pennsylvania,1994.
    [53]Thomas H, Cormen, Charles E, et al. Introducion to Algorithms. MIT Press, Cam bridge, MA,2001.
    [54]史晓红,刘文斌,王燕等.图论方法研究蛋白质结构预测问题.生物技术,2005,15(5)：89-92.
    [55]Koch I, Kadon F, Selbig J. Analysis of sheet topologies by graph theory methods.
    Protein:Struct. Funct. Genet,1992, (12):314-323.
    [56]兰家隆,刘军.应用图论及算法.成都：电子科技大学出版社,1995.
    [57]彭征宇.蛋白质中的自折叠单元.见：郝柏林,刘寄星.理论物理与生命科学.上海：上海科学技术出版社,1997.
    [58]肖位枢.图论及其算法.北京：航空工业出版社,1993.
    [59]Piero F, Rita C. Prediction of disulfide connectivity in proteins. Bioinformatics,2001, 17:957-964.
    [60]来鲁华等.蛋白质的结构预测与分子设计.北京：北京大学出版社,1993.
    [61]Gabow H N. An efficient implementation of Edmonds'algorithm for maximum weight matching on graphs. Technical Report. CU-CS-075-75. Department of Computer Science, Colorado University,1975.
    [62]Chou K C, Nemethy G, Scheraga H A. Energetics of interactions of regular structural elements in proteins. Accts Chem. Res,1990, (23):134-141.
    [63]Patra S M, Vishveshwara S. Classification of polymer structures by agraph theory. Int. J. Quantum Chem,1998, (71):349-356.
    [64]Patra S M, Vishveshwara S. Backbone cluster identification in proteinsby agraph theoretical method. Biophysical Chemistry,2000, (84):13-25.
    [65]Chen K, Kurgan L, Ruan J. Optimization of the sliding window size for protein structure prediction.Proceedings of the 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 06:366-372.
    [66]王涛,李伟生.最短路径子图.北方交通大学学报,2004.4,28(2)：46-49.
    [67]孙强,沈建华,顾君忠.Dijkstra的一种改进算法.计算机工程与应用,2003,38(3)：99-101.
    [68]Lee B C, Kim D. New design of neural network input and output vectors in the protein secondary structure prediction. Bioinformatics and Biosystems,2006,1(4): 82-90.
    [69]Adleman L M. Molecular computation of solutions to combinatorial problems. Science, 1994,266(11):1021-1023.
    [70]Lipton R J. DNA solution of hard computational problems. Science,1995,268(28): 542-545.
    [71]Ouyang Q, Kaplan P D, Liu S, et al. DNA solution of the maximal clique problem. Science,1997,278:446-449.
    [72]Ram S, John M. A Graph-theoretic Algorithm for Comparative Modeling of Protein Structure. J. Mol. Biol,1998,279,287-302.
    [73]张凯,耿修堂,肖建华等.DNA编码问题及其复杂性研究.计算机应用研究,2008.25(11)：3264.
    [74]Rabiner L R. A tutorial on hidden markov models and selected applieations in sPeeeh recognition. Proceedings of the IEEE,1989.77(2):257-285.
    [75]Karplus K, Barrett C, Hughey R. Hidden markov models for detecting remote protein homologies. Bioinformatics,1998,14(10):846-56.
    [76]Durbin R, Eddy S, Krogh A, et al. Biological sequence analysis:probabilistic models of proteins and nucleic acids. Cambridge University Press,1998.
    [77]Bystroff C, Thorsson V, Baker D. HMMSTR:a hidden markov model for local sequence-structure correlations in proteins. J. Mol. Biol,2000),301:173-90.
    [78]Christos L, Costas P, Themis P, et al. Sequence-based protein structure prediction using a reduced state-space hidden Markov model. Computers in Biology and Medicine,2007,37(9):1211-1224.
    [79]陈晴.基于条件随机场的自动分词技术的研究.东北大学硕士学位论文,2005.
    [80]向晓雯.基于条件随机场的中文命名实体识别.厦门大学硕士学位论文,2006.
    [81]Andrew McCallum, Dyane Freitag, Fernando Pereira. Maximum entropy Markov models. For inofmration extraction and segmentation. In Proc. ICML 2000,591-598.
    [82]Liu Y, Carbonell J, Klein-Seetharaman J, et al. Comparison of probabilistic combination methods for protein secondary structure prediction. Bioinformatics,2004, 20:3099-107.
    [83]Jiao X, Wang B, Su J, et al. Protein design based on the relative entropy. Physical Review E,2006,73:061903.
    [84]Wu S, Skolnick J, Zhang Y. Ab initio modeling of small proteins by iterative TASSER simulations. BMC biology.2007,5:17.
    [85]Mundra P, Kumaur M, Kumar K K, Jayaraman V K, Kulkami B D. Using pesudo amino acid composition to predict protein subnuclear localization:approached with PSSM. Pattern Recogn Lett,2007,28:1610-1615.
    [86]Colombo G, Micheletti C. Protein folding simulations:Combining coarse-grained models and all-atom molecular dynamics. Theor Chem Acc,2006,116 (5):75-86.
    [87]Shamion C E. A mathematical theory of communication. Bell SystemTeeh. Journal, 1948.27:379-423 and 623-656.
    [88]Adam L. Begrer, Vincent J. Della Pietra, Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics,1996,22(1): 39-71.
    [89]Cuff J A, Barton G J. Evaluation and improvement of mulitiple sequence methods for Protein secondary strcture prediction. Proteins,1999,34(4):508-519.
    [90]Kim H, Park H. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng,2003,16:553-560.
    [91]Qin S, He Y, Pan X M. Prediction protein secondary structure and solvent accessibility with an improved multiple linear regression method. Proteins,2005,61, 473-480.
    [92]飞思科技产品研发中心.MATLAB6.5辅助神经网络分析与设计.北京：电子工业出版社,2003：177-191.
    [93]Muskal S M, Holbrook R S, Kim S H. Prediction of the disulfide-bonding state of cysteine in proteins. Protein Eng,1990 (3):667-672.
    [94]Fariselli P, Riccobelli P, Casadio R. Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins,1999 (36):340-346.
    [95]Piero F, Rita C. Prediction of disulfide connectivity in proteins. Bioinformatics,2001, 17:957-964.
    [96]Fariselli P, Martelli P L. Cassadio R. Aneural network-based method for predicting the disulfide connectivity in proteins. In Damiani, E, et al. Knowledge based intelligent information engineering systems and allied technologies (KES 2002),2002,1, 464-468.
    [97]Frasconi P, Passerini A, Vtlllo A. A two stage SVM architecture for predicting the disulfide bonding state of cysteines. In proceeding of IEEE Neural network for signal processing conference. IEEE Press,2002,25-34.
    [98]Eisenberg D, Weiss R M. The hydrophobic moment detects periodicity in protein hydrophobicity. Nature,1982 (317):2672-2685.
    [99]Lim V L. Prediction of secondary structure of proteins form their amino-acid sequence. J. Mol. Biol,1974 (88):857-869.
    [100]Stillinger F H, Head-Gordon T, C L. Toy model for protein folding. Phys. Hirschfeld Rev,1993, E48:1469-1477.
    [101]Hsu H P, Mehra V, Grassbeger P. Structure Optimization in an Off-Lattice Protein Model. Physical Review,2003,68.
    [102]Katagiri D, Fuji H, Neya S, Hoshino T. Ab initio protein structure prediction with force field parameters derived from water-phase quantum chemical calculation. Journal of Computational Chemistry.2008,29(12):1930-1944.
    [103]Yuksektepe F U, Yilmaz O, Turkay M.Prediction of secondary structures of proteins using a two-stage method. Computers & Chemical Engineering,2008,32:78-88.
    [104]Bachmann M, Arkin H, Janke W. Multicanonical Study of Coarse-grained Off-lattice Models for Folding Heteropolymers. Phys. Rev.,2005,71(3):1-15.
    [105]Kim S Y, Lee S B, Lee Jooyoung. Structure Optimization by Conformational Space Annealing in an Off-lattice Protein Model. Phys. Rev.,2005,72(1):61-66.
    [106]Eisenberg D, Weiss R M. The hydrophobic moment detects periodicity in protein hydrophobicity. PNAS,1984,20(2):81-140.
    [107]Lim V L. Prediction of secondary structure of proteins from their amino acid sequence. J. Mol. Biol,1974,88(4):857-869.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700