Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest
详细信息    查看全文
  • 作者:Juyong Lee (1) (2)
    Kiho Lee (2)
    InSuk Joung (2) (4)
    Keehyoung Joo (2) (3)
    Bernard R Brooks (1)
    Jooyoung Lee (2) (4)

    1. Laboratory of Computational Biology
    ; National Heart ; Lung ; and Blood Institute ; National Institutes of Health ; 5635 Fishers Ln ; Bethesda ; 20852 ; USA
    2. Center for In Silico Protein Science
    ; Korea Institute for Advanced Study ; Seoul ; Korea
    4. School of Computational Sciences
    ; Korea Institute for Advanced Study ; Seoul ; Korea
    3. Center for Advanced Computation
    ; Korea Institute for Advanced Study ; Seoul ; Korea
  • 关键词:Template ; based modeling ; Homology modeling ; Random forest ; Machine learning ; Protein structure ; Protein structure prediction ; Protein sequence ; Bioinformatics ; Statistics
  • 刊名:BMC Bioinformatics
  • 出版年:2015
  • 出版时间:December 2015
  • 年:2015
  • 卷:16
  • 期:1
  • 全文大小:3,103 KB
  • 参考文献:1. S枚ding, J (2005) Protein homology detection by HMM鈥揌MM comparison. Bioinformatics 21: pp. 951-60 formatics/bti125" target="_blank" title="It opens in new window">CrossRef
    2. Hildebrand, A, Remmert, M, Biegert, A, S枚ding, J (2009) Fast and accurate automatic structure prediction with hhpred. Proteins: Struct, Funct, Bioinf. 77: pp. 128-32 CrossRef
    3. Peng, J, Xu, J (2009) Boosting protein threading accuracy. Research in Computational Molecular Biology. Springer Berlin, Heidelberg
    4. Peng, J, Xu, J (2011) RaptorX: Exploiting structure information for protein alignment by statistical inference. Proteins: Struct Funct Bioinf. 79: pp. 161-71 CrossRef
    5. Wu, S, Zhang, Y (2008) MUSTER: improving protein sequence profile鈥損rofile alignments by using multiple sources of structure information. Proteins: Struct Funct Bioinf. 72: pp. 547-56 CrossRef
    6. Yang, Y, Faraggi, E, Zhao, H, Zhou, Y (2011) Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. Bioinformatics 27: pp. 2076-82 formatics/btr350" target="_blank" title="It opens in new window">CrossRef
    7. Joo, K, Lee, J, Kim, I, Lee, SJ, Lee, J (2008) Multiple sequence alignment by conformational space annealing. Bioph J. 95: pp. 4813-9 8.129684" target="_blank" title="It opens in new window">CrossRef
    8. Pei, J, Grishin, NV (2007) PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23: pp. 802-8 formatics/btm017" target="_blank" title="It opens in new window">CrossRef
    9. Armougom, F, Moretti, S, Poirot, O, Audic, S, Dumas, P, Schaeli, B (2006) Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res. 34: pp. 604-8 CrossRef
    10. Cozzetto, D, Kryshtafovych, A, Fidelis, K, Moult, J, Rost, B, Tramontano, A (2009) Evaluation of template-based models in CASP8 with standard measures. Proteins: Struct Funct Bioinf. 77: pp. 18-28 CrossRef
    11. Mariani, V, Kiefer, F, Schmidt, T, Haas, J, Schwede, T (2011) Assessment of template based protein structure predictions in CASP9. Proteins: Struct Funct Bioinf. 79: pp. 37-58 CrossRef
    12. Kryshtafovych, A, Fidelis, K, Moult, J (2011) CASP9 results compared to those of previous casp experiments. Proteins: Struct Funct Bioinf. 79: pp. 196-207 82" target="_blank" title="It opens in new window">CrossRef
    13. Moult, J, Fidelis, K, Kryshtafovych, A, Schwede, T, Tramontano, A (2014) Critical assessment of methods of protein structure prediction (CASP) - round X. Proteins: Struct Funct Bioinf. 82: pp. 1-6 CrossRef
    14. Kryshtafovych, A, Moult, J, Bales, P, Bazan, JF, Biasini, M, Burgin, A (2014) Challenging the state of the art in protein structure prediction: Highlights of experimental target structures for the 10th critical assessment of techniques for protein structure prediction experiment CASP10. Proteins: Struct Funct Bioinf. 82: pp. 26-42 89" target="_blank" title="It opens in new window">CrossRef
    15. Joo, K, Lee, J, Lee, S, Seo, JH, Lee, SJ, Lee, J (2007) High accuracy template based modeling by global optimization. Proteins: Struct Funct Bioinf. 69: pp. 83-9 8" target="_blank" title="It opens in new window">CrossRef
    16. Sali, A, Blundell, T (1994) Comparative protein modelling by satisfaction of spatial restraints. Protein Struct Distance Anal. 64: pp. 86
    17. Fiser, A, 艩ali, A (2003) Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol. 374: pp. 461-91 879(03)74020-8" target="_blank" title="It opens in new window">CrossRef
    18. Krieger, E, Joo, K, Lee, J, Lee, J, Raman, S, Thompson, J (2009) Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins: Struct Funct Bioinf. 77: pp. 114-22 CrossRef
    19. Xu, J, Peng, J, Zhao, F (2009) Template-based and free modeling by RAPTOR++ in CASP8. Proteins: Struct Funct Bioinf. 77: pp. 133-7 CrossRef
    20. Joo, K, Lee, J, Seo, JH, Lee, K, Kim, BG, Lee, J (2009) All-atom chain-building by optimizing modeller energy function using conformational space annealing. Proteins: Struct Funct Bioinf. 75: pp. 1010-23 CrossRef
    21. Joo, K, Lee, J, Sim, S, Lee, SY, Lee, K, Heo, S (2014) Protein structure modeling for CASP10 by multiple layers of global optimization. Proteins: Struct Funct Bioinf. 82: pp. 188-95 CrossRef
    22. Thompson, J, Baker, D (2011) Incorporation of evolutionary information into rosetta comparative modeling. Proteins: Struct Funct Bioinf. 79: pp. 2380-8 CrossRef
    23. Breiman, L (2001) Random forests. Mach Learn. 45: pp. 5-32 CrossRef
    24. Lee, J, Lee, J (2013) Hidden information revealed by optimal community structure from a protein-complex bipartite network improves protein function prediction. PLoS ONE 8: pp. 60372 CrossRef
    25. Lee, J, Gross, SP, Lee, J (2013) Improved network community structure improves function prediction. Sci Rep. 3: pp. 2197
    26. Ziegler, A, K枚nig, IR (2014) Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev: Data Min Knowl Discov. 4: pp. 55-63
    27. Manavalan, B, Lee, J, Lee, J (2014) Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms. PLoS ONE 9: pp. 106542 CrossRef
    28. Caruana, R, Karampatziakis, N, Yessenalina, A (2008) An empirical evaluation of supervised learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning. ICML 鈥?8. ACM, New York, NY, USA
    29. Zhang, Y, Skolnick, J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57: pp. 702-10 CrossRef
    30. Mariani, V, Biasini, M, Barbato, A, Schwede, T (2013) lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics (Oxford, England) 29: pp. 2722-8 formatics/btt473" target="_blank" title="It opens in new window">CrossRef
    31. Wang, G, Dunbrack, RL (2003) PISCES: a protein sequence culling server. Bioinformatics 19: pp. 1589-91 formatics/btg224" target="_blank" title="It opens in new window">CrossRef
    32. Kopp, J, Bordoli, L, Battey, JND, Kiefer, F, Schwede, T (2007) Assessment of CASP7 predictions for template-based modeling targets. Proteins: Struct Funct Bioinf. 69: pp. 38-56 CrossRef
    33. Petersen, TN, Lundegaard, C, Nielsen, M, Bohr, H, Bohr, J, Brunak, S (2000) Prediction of protein secondary structure at 80% accuracy. Proteins: Struct Funct Bioinf. 41: pp. 17-20 CrossRef
    34. Joo, K, Lee, SJ, Lee, J (2012) SANN: solvent accessibility prediction of proteins by nearest neighbor method. Proteins: Struct Funct Bioinf. 80: pp. 1791-7
    35. Breiman, L, Friedman, JH, Olshen, RA, Stone, CJ (1984) Classification and regression trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont, California, USA
    36. Quinlan, JR (1986) Induction of decision trees. Mach Learn. 1: pp. 81-106
    37. Fiser, A, Do, RKG, Sali, A (2000) Modeling of loops in protein structures. Protein Sci. 9: pp. 1753-73 CrossRef
    38. Pastore, A, Atkinson, RA, Saudek, V, Williams, RJ (1991) Topological mirror images in protein structure computation: an underestimated problem. Proteins 10: pp. 22-32 CrossRef
    39. Liwo, A, Lee, J, Ripoll, DR, Pillardy, J, Scheraga, HA (1999) Protein structure prediction by global optimization of a potential energy function. Proc Nat Acad Sci USA. 96: pp. 5482-5 82" target="_blank" title="It opens in new window">CrossRef
    40. Kihara, D, Lu, H, Kolinski, A, Skolnick, J (2001) TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Nat Acad Sci USA. 98: pp. 10125-30 81328398" target="_blank" title="It opens in new window">CrossRef
    41. Zhang, Y (2009) I-TASSER: fully automated protein structure prediction in CASP8. Proteins 77: pp. 100-13 88" target="_blank" title="It opens in new window">CrossRef
  • 刊物主题:Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms;
  • 出版者:BioMed Central
  • ISSN:1471-2105
文摘
Background In template-based modeling when using a single template, inter-atomic distances of an unknown protein structure are assumed to be distributed by Gaussian probability density functions, whose center peaks are located at the distances between corresponding atoms in the template structure. The width of the Gaussian distribution, the variability of a spatial restraint, is closely related to the reliability of the restraint information extracted from a template, and it should be accurately estimated for successful template-based protein structure modeling. Results To predict the variability of the spatial restraints in template-based modeling, we have devised a prediction model, Sigma-RF, by using the random forest (RF) algorithm. The benchmark results on 22 CASP9 targets show that the variability values from Sigma-RF are of higher correlations with the true distance deviation than those from Modeller. We assessed the effect of new sigma values by performing the single-domain homology modeling of 22 CASP9 targets and 24 CASP10 targets. For most of the targets tested, we could obtain more accurate 3D models from the identical alignments by using the Sigma-RF results than by using Modeller ones. Conclusions We find that the average alignment quality of residues located between and at two aligned residues, quasi-local information, is the most contributing factor, by investigating the importance of input features used in the RF machine learning. This average alignment quality is shown to be more important than the previously identified quantity of a local information: the product of alignment qualities at two aligned residues.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700