A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
详细信息    查看全文
  • 作者:George Papadatos ; Gerard JP van Westen ; Samuel Croset…
  • 关键词:Machine learning ; Triage ; Curation ; Document classification
  • 刊名:Journal of Cheminformatics
  • 出版年:2014
  • 出版时间:December 2014
  • 年:2014
  • 卷:6
  • 期:1
  • 全文大小:2,366 KB
  • 参考文献:1. Bento, AP, Gaulton, A, Hersey, A, Bellis, LJ, Chambers, J, Davies, M, Krüger, FA, Light, Y, Mak, L, McGlinchey, S, Nowotka, M, Papadatos, G, Santos, R, Overington, JP (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42: pp. D1083-D1090 CrossRef
    2. Rebholz-Schuhmann, D, Kirsch, H, Couto, F (2005) Facts from text–is text mining ready to deliver?. PLoS Biol 3: pp. e65 CrossRef
    3. Burge, S, Attwood, TK, Bateman, A, Berardini, TZ, Cherry, M, O’Donovan, C, Xenarios, L, Gaudet, P (2012) Biocurators and biocuration: surveying the 21st century challenges. Database (Oxford) 2012: pp. bar059
    4. Europe PubMed Central. [http://europepmc.org/]
    5. PubMed/MEDLINE. [med.org/" class="a-plus-plus">http://www.pubmed.org]
    6. Rebholz-Schuhmann, D, Arregui, M, Gaudan, S, Kirsch, H, Jimeno, A (2008) Text processing through web services: calling Whatizit. Bioinformatics 24: pp. 296-298 CrossRef
    7. Jessop, DM, Adams, SE, Willighagen, EL, Hawizy, L, Murray-Rust, P (2011) OSCAR4: a flexible architecture for chemical text-mining. J Cheminform 3: pp. 41 CrossRef
    8. Rockt?schel, T, Weidlich, M, Leser, U (2012) ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics 28: pp. 1633-1640 CrossRef
    9. Arighi, CN, Cohen, KB, Hirschman, L, Lu, Z, Tudor, CO, Wiegers, T, Wilbur, WJ, Wu, CH (2013) Proceedings of the fourth BioCreative challenge evaluation workshop. Maryland, USA, Bethesda
    10. Davis, AP, Wiegers, TC, Johnson, RJ, Lay, JM, Lennon-Hopkins, K, Saraceni-Richards, C, Sciaky, D, Murphy, CG, Mattingly, CJ (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One 8: pp. e58201 CrossRef
    11. Vishnyakova, D, Pasche, E, Ruch, P (2012) Using binary classification to prioritize and curate articles for the comparative toxicogenomics database. Database (Oxford) 2012: pp. bas050 CrossRef
    12. Mitchell, TM (1997) Machine learning. McGraw-Hill, Inc., New York, NY, USA
    13. Domingos, P, Pazzani, M (1997) On the optimality of the simple bayesian classifier under zero–one loss. Mach Learn 29: pp. 103-130 CrossRef
    14. Breiman, L (2001) Random forests. Mach Learn 45: pp. 5-32 CrossRef
    Pipeline pilot.
    15. Berthold, MR, Cebron, N, Dill, F, Gabriel, TR, K?tter, T, Meinl, T, Ohl, P, Sieb, C, Thiel, K, Wiswedel, B (2007) KNIME: the konstanz information miner. Springer, In Stud. Classif. Data Anal. Knowl. Organ
    16. Liu, T, Lin, Y, Wen, X, Jorissen, RN, Gilson, MK (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res 35: pp. D198-D201 CrossRef
    17. Westen, GJP, Gaulton, A, Overington, JP (2014) Chemical, target, and bioactive properties of allosteric modulation. PLoS Comput Biol 10: pp. e1003559 CrossRef
    18. Brown, HL (2012) Pay-per-view in interlibrary loan: a case study. J Med Libr Assoc 100: pp. 98-103 CrossRef
    19. Malaria-data resource. [ac.uk/chembl/malaria/" class="a-plus-plus">https://www.ebi.ac.uk/chembl/malaria/]
  • 刊物类别:Physics and Astronomy
  • 刊物主题:Computer Applications in Chemistry
    Theoretical and Computational Chemistry
    Computational Biology/Bioinformatics
    Documentation and Information in Chemistry
  • 出版者:Chemistry Central Ltd
  • ISSN:1758-2946
文摘
Background The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like-(i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining. Results The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches. Conclusions Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data. ?/h3> Graphical Abstract Multidimensional scaling analysis applied to document vectors derived from titles and abstracts in different corpora. Notably, there is large overlap between the documents in the different ChEMBL versions and BindingDB, while the background MEDLINE set is largely divergent.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700