基于HMM的蛋白质侧链建模及其应用的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
从蛋白质的氨基酸序列预测蛋白质三维结构是当前生物信息学领域中的一个非常具有挑战性的问题。而蛋白质侧链预测是蛋白质结构预测以及蛋白质设计中非常重要的子问题。
     本文提出一种基于隐马尔科夫模型的蛋白质侧链建模技术,构建了侧链的基于序列和基于序列骨架的两个模型。基于序列的侧链模型以氨基酸串作为主要的观测数据;而基于序列骨架的侧链模型则再增加骨架扭转角作为主要观测数据,建立观测数据到侧链构象的对应关系。通过精心选择的训练集,对上述两个模型进行训练。训练完成后的两个模型就可以通过采样产生针对某个氨基酸串的侧链旋转异构体库。与流行的建模方法比较,经采样得到的旋转异构体库更加接近于天然构象。
     根据本文提出的两个模型分别产生的旋转异构体库,我们分别将其应用到流行的侧链预测系统中,并与现有两种蛋白质侧链预测领域的权威方法进行了多角度的比较。与基于传统的骨架相关旋转异构体库的侧链预测结果相比,在预测精度上有了一定的提高。
     综合实验表明,基于序列相关和序列骨架相关的两个模型,可以很充分地把序列氨基酸对特定氨基酸的侧链构象的影响进行挖掘,分别生成序列相关旋转异构体库和序列骨架相关的旋转异构体库,从而为蛋白质侧链预测提供良好的支持。
It has been a challenge of bioinformatics to predict the protein structure from itsamino acid sequence for a long period of time. Side-chain prediction is an importantsub-problem of protein structure prediction and design.
     The thesis proposes a strategy of modeling protein side-chain based on HiddenMarkov Model. We construct sequence-dependent model and sequence-backbone-dependent model for protein side-chain. The sequence-dependent model takes aminoacids sequence information as main observed data; while the sequence-backbone-dependentmodel considers not only amino acids sequence information but also backbone dihedralangles as observed data. The models are trained to learn the relationships among theobserved data and side-chain conformation. After being trained based on compara-tive training data, these two models can produce the specific rotamer library giventhe target amino acids sequence by sampling the two trained models. Comparing torotamer library produced by the other popular side-chain modeling methods, ours aremore close to native conformation.
     We apply the rotamer libraries generated by our models to the state-of-the-artsystem of side-chain prediction. We compare the accuracy of predicted protein side-chain based on di?erent rotamer libraries generated by di?erent side-chain models. Wefind that the prediction accuracy outperforms on all the test targets to a certain extentcomparing with that based on traditional backbone dependent rotamer libraries.
     This thesis verifies that the sequence-dependent model and sequence- and backbone-dependent model will enable fully taking into account the possible a?ection of sequenceamino acids to the specific amino acid side-chain. The rotamer libraries generated bysequence-dependent models provide solid support to protein side-chain prediction.
引文
[1] Christian B. Anfinsen. Principles that Govern the Folding of Protein Chains[J].Science, 1973, 181(4096): 223–230.
    [2] Christian Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason Pa-padopoulos, Kevin Bealer and Thomas L Madden. BLAST+: architecture andapplications[J]. BMC Bioinformatics, 2009, 10(421).
    [3] Krzysztof Ginalski and Leszek Rychlewski. Protein Structure Prediction ofCASP5 Comparative Modeling and Fold Recognition Targets Using ConsensusAlignment Approach and 3D Assessment[J]. Proteins: Structure, Function, andGenetics, 2003, 53: 410–417.
    [4] Andrzej Kolinski. Protein modeling and structure prediction with a reduced rep-resentation[J]. Acta Biochimica Polonica, 2004, 51(2): 349–371.
    [5] Sitao Wu, Je?rey Skolnick and Yang Zhang. Ab initio modeling of small proteinsby iterative TASSER simulations[J]. BMC Biology, 2007, 5(17).
    [6] Janusz M. Bujnicki. Protein-Structure Prediction by Recombination of Frag-ments[J]. ChemBioChem, 2006, 7(1): 19–27.
    [7] Philip Bradley, Kira M. S. Misura and David Baker. Toward High-Resolutionde Novo Structure Prediction for Small Proteins[J]. Science, 2005, 309(5742):1868–1871.
    [8] John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Burkhard Rost andAnna Tramontano. Critical assessment of methods of protein structure predic-tion―Round VIII[J]. Proteins: Structure, Function, and Bioinformatics, 2009,77(9): 1–4.
    [9] Roland L. Dunbrack Jr. and M Karplus. Backbone-dependent rotamer library forproteins: Application to side-chain prediction[J]. Journal of Molecular Biology,1993, 230(2): 543–574.
    [10] Georgii G. Krivov, Maxim V. Shapovalov and Roland L. Dunbrack Jr. Improvedprediction of protein side-chain conformations with SCWRL4[J]. Proteins: Struc-ture, Functions, Bioinformatics, 2009, 77(4): 778–795.
    [11] IUPAC-IUB Commission on Biochemical Nomenclature. Abbreviations and Sym-bols for the Description of the Conformation of Polypeptide Chains[J]. Journalof Biological Chemistry, 1970, 245(24): 6489–6497.
    [12]汪世华.蛋白质工程[M].科学出版社, 2008.
    [13] Roland L. Dunbrack Jr. Rotamer libraries in the 21st century[J]. Current Opinionsin Structual Biology, 2002, 12: 431–440.
    [14] G.N. Ramachandran, C. Ramakrishnan and V. Sasisekharan. Stereochemistry ofpolypeptide chain configurations[J]. Journal of Molecular Biology, 1963, 7(1):95–99.
    [15] R. Chandrasekaran and G.N. Ramachandran. Studies on the conformation ofamino acids. XI. Analysis of the observed side group conformations in proteins[J].International Journal of Protein Research, 1970, 2: 223–233.
    [16] Ettore Benedetti, Giancarlo Morelli, George Nemethy and Harlord A. Scheraga.Statistical and energetic analysis of side-chain conformations in oligopeptides[J].Journal of Peptide Research, 1983, 22: 1–15.
    [17] Roland L. Dunbrack Jr. and Fred E.Cohen. Bayesian statistical analysis of proteinside-chain rotamer preferences[J]. Protein Science, 1997, 6(8): 1661–1681.
    [18] Carol A. Rohl, Charlie E. M. Strauss, Kira M. S. Misura and David Baker. ProteinStructure Prediction Using Rosetta[J]. Methods In Enzymology, 2004, 383: 66–93.
    [19] Jinzhen Wu, Qiang Lu¨, Xu Huang and Lingyun Yang. De novo Predictionof Protein Backbone by Parallel Ant Colonies[J]. in submission, Oct. 2009.http://www.zhhz.net/~qiang/pacBackbone.
    [20] Amos Bairoch, Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeck-mann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez,Michele Magrane, Maria J. Martin1, Darren A. Natale, Claire O’Donovan, NicoleRedaschi and Lai-Su L. Yeh. The Universal Protein Resource (UniProt)[J]. Nu-cleic Acids Research, 2005, 33: 154–159.
    [21] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T.N. Bhat,Heldge Weissig, Ilya N. Shindyalov and Philip E. Bourne. The Protein DataBank[J]. Nucleic Acids Res., 2000, 28(1): 235–242.
    [22]葛晓春.蛋白质结构与功能入门[M].科学出版社, 2009.
    [23]王镜岩,朱圣庚,徐长法.生物化学[M].高等教育出版社,北京, 2003.
    [24] Jay W. Ponder and Frederic M. Richards. Tertiary templates for proteins: Use ofpacking criteria in the enumeration of allowed sequences for di?erent structuralclasses[J]. Journal of Molecular Biology, 1987, 193(4): 775–791.
    [25] John Kuszewski, Angela M. Gronenborn and G. Marius Clore. Improving thequality of NMR and crystallographic protein structures by means of a conforma-tional database potential derived from structure databases[J]. Protein Science,1996, 5(6): 1067–1080.
    [26] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Appli-cations in Speech Recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257–286.
    [27] Gary A. Churchill. Stochastic models for heterogeneous DNA sequences[J]. Bul-letin of Mathematical Biology, 1989, 51: 79–94.
    [28] Richard Durbin, Sean R. Eddy, Anders Krogh and Graeme Mitchison. Biolog-ical Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids[M].Cambridge University Press, 1999.
    [29] Sean R. Eddy. Hidden Markov Models[J]. Current Opinion in Structural Biology,1996, 6: 361–365.
    [30] Anders Krogh, Michael Brown, Saira Mian, Kiminen Sjolander and David Haus-der. Hidden Markov models in computational biology: Applications to proteinmodeling[[J]. Journal of Molecular Biology, 1994, 235(5): 1501–1531.
    [31] Leonard E. Baum, Ted Petrie, George Soules and Norman Weiss. A maximizationtechnique occurring in the statistical analysis of probabilistic functions of Markovchains.[J]. Annals of Mathematical Statistics, 1970, 41(1): 164–171.
    [32] Mocapy++ A toolkit for inference and learning in dynamic Bayesian networks.Martin Paluszewski and Thomas Hamelryck[J]. BMC Bioinformatics, 2010,11(126).
    [33] Z. Ghahramani. Learning dynamic Bayesian networks[J]. Lecture Notes in Com-puter Science, 1998, 1387: 168–197.
    [34] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learn-ing.[D]. UC Berkeley Computer Science Division, 2002.
    [35] Roland L. Dunbrack. PDB List[EB/OL], 2002. http://dunbrack.fccc.edu/bbdep/.
    [36] Gideon Schwarz. Estimating the dimension of a model[J]. The Annals of Statistics,1978, 6(2): 461–464.
    [37] G. David Forney. The Viterbi Algorithm[J]. Proceedings of the IEEE, 1973, 61(3):268–278.
    [38] Bing-Hwang Juang and Lawrence R. Rabiner. The segmental K-means algorithmfor estimating parameters of hidden Markov models.[J]. IEEE Transactions onAcoustics,Speech,and Signal Processing, 1990, 38(9): 1639–1641.
    [39] Stephen P. Brooks. Markov chain Monte Carlo method and its applications[J].The Statistician, 1998, 47: 69–100.
    [40] W.R. Gilks, S.T. Richardson and D.J. Spiegelhalter. Markov Chain Monte-Carloin practice[M]. Chapman & Hall, London, 1996.
    [41] Richard Durbin, Sean R. Eddy, Anders Krogh and Graeme Mitchison. Biologicalsequence analysis: probabilistic models of proteins and nucleic acids[M]. Cam-bridge Univ. Press, 2000.
    [42] Wouter Boomsma, Kanti V. Mardia, Charles C. Taylor, Jesper Ferkingho?borg,Anders Krogh and Thomas Hamelryck. A generative, probabilistic model of localprotein structure[J]. PNAS, 2008, 105(26): 8932–8937.
    [43] Robert Tarjan. Depth-First Search and Linear Graph Algorithms[J]. SIAM Jour-nal on Computing, 1972, 1(2): 146–160.
    [44] Adrian A. Canutescu, Andrew A. Shelenkov and Roland L. Dunbrack Jr. A graph-theory algorithm for rapid protein side-chain prediction[J]. Protein Science, 2003,12(9): 2001–2014.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700