Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins
详细信息    查看全文
  • 作者:Jorge Alberto Jaramillo-Garzón (1) (3)
    Joan Josep Gallardo-Chacón (2) (5)
    César Germán Castellanos-Domínguez (1)
    Alexandre Perera-Lluna (2) (4)
  • 刊名:BMC Bioinformatics
  • 出版年:2013
  • 出版时间:December 2013
  • 年:2013
  • 卷:14
  • 期:1
  • 全文大小:436KB
  • 参考文献:1. The Gene Ontology Consortium: The gene ontology (GO) database and informatics resource. / Nucleic Acids Res 2004, 32:258-61. CrossRef
    2. Levitt M: Nature of the protein universe. / Proc Natl Acad Sci 2009,106(27):11079. CrossRef
    3. Baldi P, Brunak S: / Bioinformatics: the Machine Learning Approach. Cambridge: The MIT Press; 2001.
    4. Zhao X, Chen L, Aihara K: Protein function prediction with high-throughput data. / Amino Acids 2008,35(3):517-30. CrossRef
    5. Pandey G, Kumar V, Steinbach M: / Computational approaches for protein function prediction: a survey. Twin Cities: Tech Rep, 06-28 Department of Computer Science and Engineering, University of Minnesota; 2006.
    6. Friedberg I: Automated protein function prediction–the genomic challenge. / Brief Bioinformatics 2006,7(3):225. CrossRef
    7. Altschul SF, Madden TL, Sch?ffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. / Nucleic Acids Res 1997,25(17):3389-402. CrossRef
    8. Groth D, Lehrach H, Hennig S: GOblet: a platform for Gene Ontology annotation of anonymous sequence data. / Nucleic Acids Res 2004,32(Web Server issue):W313—w317.
    9. Zehetner G: OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms. / Nucleic Acids Res 2003,31(13):3799-803. CrossRef
    10. Khan S: GoFigure: Automated gene ontologyTM annotation. / Bioinformatics 2003,19(18):2484-485. CrossRef
    11. Martin DMA, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. / BMC Bioinformatics 2004, 5:178. CrossRef
    12. Hawkins T, Chitale M, Luban S, Kihara D: PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. / Proteins 2009,74(3):566-82. CrossRef
    13. Jones CE, Schwerdt J, Bretag TA, Baumann U, Brown AL: GOSLING: a rule-based protein annotator using BLAST and GO. / Bioinformatics (Oxford, England) 2008,24(22):2628-629. CrossRef
    14. Conesa A, G?tz S: Blast2GO: A comprehensive suite for functional analysis in plant genomics. / Int J Plant Genomics 2008, 2008:619832. CrossRef
    15. Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, K?nig R: GOPET: a tool for automated predictions of gene ontology terms. / BMC bioinformatics 2006, 7:161. CrossRef
    16. Jensen L, Gupta R, Staerfeldt H, Brunak S: Prediction of human protein function according to Gene Ontology categories. / Bioinformatics 2003,19(5):635. CrossRef
    17. Jung J, Thon MR: Gene function prediction using protein domain probability and hierarchical gene ontology information. / 2008 19th Int Conf Pattern Recognit 2008, 19:1-.
    18. Cai CZ: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. / Nucleic Acids Res 2003,31(13):3692-697. CrossRef
    19. Bi R, Zhou Y, Lu F, Wang W: Predicting gene ontology functions based on support vector machines and statistical significance estimation. / Neurocomputing 2007,70(4-):718-25. CrossRef
    20. Jung J, Yi G, Sukno SA, Thon MR: PoGO: Prediction of gene ontology terms for fungal proteins. / BMC bioinformatics 2010, 11:215. CrossRef
    21. Small I, Peeters N, Legeai F, Lurin C: Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences. / Proteomics 2004,4(6):1581-590. CrossRef
    22. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. / J Mol Biol 2000,300(4):1005-016. CrossRef
    23. Chou KC, Shen HB: Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. / PloS one 2010,5(6):e11335. CrossRef
    24. Briesemeister S, Rahnenführer J, Kohlbacher O: Going from where to why–interpretable prediction of protein subcellular localization. / Bioinformatics (Oxford, England) 2010,26(9):1232-238. CrossRef
    25. Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N: PROSITE, a protein domain database for functional characterization and annotation. / Nucleic Acids Res 2010,38(Database issue):D161—D166.
    26. R Core Team: / R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2012. [http://www.R-project.org/]. [ISBN 3-00051-7-]
    27. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. / Genome Biol 2004, 5:R80. CrossRef
    28. Charif D, Lobry J: SeqinR 1.0-: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In / Structural approaches to sequence evolution: Molecules, networks, populations. Edited by: Bastolla U, Porto HRM, Vendruscolo M.. New York, Springer Verlag: Biological and Medical Physics, Biomedical Engineering; 2007:207-32. CrossRef
    29. Jain E, Bairoch A, Duvaud S, Phan I, Redaschi N, Suzek B, Martin M, McGarvey P, Gasteiger E: Infrastructure for the life sciences: design and implementation of the UniProt website. / BMC Bioinformatics 2009, 10:136. CrossRef
    30. Barrell D, Dimmer E, Huntley R, Binns D, O’Donovan C, Apweiler R: The GOA database in 2009–an integrated gene ontology annotation resource. / Nucleic Acids Res 2008, 37:D396—D403.
    31. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. / Bioinformatics 2006,22(13):1658-659. CrossRef
    32. Berardini T, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, Mueller L, Yoon J, Doyle A, Lander G: Functional annotation of the Arabidopsis genome using controlled vocabularies. / Plant Physiol 2004,135(2):745. CrossRef
    33. Davis MJ, Sehgal MSB: Ragan Ma: Automatic, context-specific generation of gene ontology slims. / BMC bioinformatics 2010, 11:498. CrossRef
    34. Rhee SY, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. / Nat Rev Genet 2008,9(7):509-15. CrossRef
    35. Frishman D, Argos P: Seventy-five percent accuracy in protein secondary structure prediction. / Proteins Struct Funct and Genet 1997,27(3):329-35. CrossRef
    36. Yu L, Liu H: Efficient feature selection via analysis of relevance and redundancy. / J Mach Learn Res 2004, 5:1205-224.
    37. Chawla N, Bowyer K, Hall L, Kegelmeyer W: SMOTE: synthetic minority over-sampling technique. / J Artif Intell Res 2002,16(3):321-57.
    38. Karatzoglou A, Smola A, Hornik K, Zeileis A: kernlab -An S4 package for kernel methods in R. / J Stat Softw 2004,11(9):1-0. [http://www.jstatsoft.org/v11/i09/]
    39. Kennedy J, Eberhart R: Particle swarm optimization. / Proc ICNN-5 Int Conf Neural Netw 1995, 4:1942-948. CrossRef
    40. Whitford D: / Proteins: Structure and Function. West Sussex: Wiley; 2005.
    41. Arrigo A: Gene expression and the thiol redox state. / Free Radic Biol Med 1999,27(9-0):936-44. CrossRef
  • 作者单位:Jorge Alberto Jaramillo-Garzón (1) (3)
    Joan Josep Gallardo-Chacón (2) (5)
    César Germán Castellanos-Domínguez (1)
    Alexandre Perera-Lluna (2) (4)

    1. Departamento de Ingeniería Eléctrica, Electrónica y Computación, Universidad Nacional de Colombia sede Manizales, Campus La Nubia, km 7 vía al Magdalena, Manizales, (Caldas), Colombia
    3. Centro de Investigación, Instituto Tecnológico Metropolitano, Calle 73 No 76A - 354, Medellín (Antioquia), Colombia
    2. Centre de Recer?a en Enginyeria Biomèdica, ESAII, Universitat Politècnica de Catalunya, Pau Gargallo 5, 08028, Barcelona, Espa?a
    5. Planta de Tecnologia dels Aliments, Universidad Autónoma de Barcelona, 08193 Cerdanyola del Vallès, Catalonia, Espa?a
    4. Centro de Investigación Biomédica en Red en Bioingeniería, , Biomateriales y Nanomedicina (CIBER-BBN), Kragujevac, Espa?a
  • ISSN:1471-2105
文摘
Background Proteins are the key elements on the path from genetic information to the development of life. The roles played by the different proteins are difficult to uncover experimentally as this process involves complex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-out methods and others. The knowledge learned from each protein is usually annotated in databases through different methods such as the proposed by The Gene Ontology (GO) consortium. Different methods have been proposed in order to predict GO terms from primary structure information, but very few are available for large-scale functional annotation of plants, and reported success rates are much less than the reported by other non-plant predictors. This paper explores the predictability of GO annotations on proteins belonging to the Embryophyta group from a set of features extracted solely from their primary amino acid sequence. Results High predictability of several GO terms was found for Molecular Function and Cellular Component. As expected, a lower degree of predictability was found on Biological Process ontology annotations, although a few biological processes were easily predicted. Proteins related to transport and transcription were particularly well predicted from primary structure information. The most discriminant features for prediction were those related to electric charges of the amino-acid sequence and hydropathicity derived features. Conclusions An analysis of GO-slim terms predictability in plants was carried out, in order to determine single categories or groups of functions that are most related with primary structure information. For each highly predictable GO term, the responsible features of such successfulness were identified and discussed. In addition to most published studies, focused on few categories or single ontologies, results in this paper comprise a complete landscape of GO predictability from primary structure encompassing 75 GO terms at molecular, cellular and phenotypical level. Thus, it provides a valuable guide for researchers interested on further advances in protein function prediction on Embryophyta plants.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700