Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese
详细信息    查看全文
  • 作者:Erick R Fonseca ; Jo?o Luís G Rosa…
  • 关键词:Natural language processing ; Part ; of ; speech tagging ; Neural networks ; Word embeddings
  • 刊名:Journal of the Brazilian Computer Society
  • 出版年:2015
  • 出版时间:December 2015
  • 年:2015
  • 卷:21
  • 期:1
  • 全文大小:850 KB
  • 参考文献:1. The Penn Treebank Project (2014). http://www.cis.upenn.edu/~treebank/ Accessed April 2014.
    2. Fonseca ER, Rosa JLG (2013) Mac-Morpho revisited: towards robust part-of-speech In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology -STIL, 98-07, Fortaleza, Brazil.
    3. Aluísio, S, Pelizzoni, J, Marchi, A. R, de Oliveira, L, Manenti, R, Marquiafável, V (2003) An account of the challenge of tagging a reference corpus for brazilian portuguese. Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR-3, Springer, Berlin, Heidelberg, pp. 110-117 CrossRef
    4. Bick E (2000) The parsing system PALAVRAS: automatic grammatical analysis of Portuguese in a constraint grammar framework. PhD thesis. Department of Linguistics -Aarhus University.
    5. dos Santos CN, Milidiú RL, Rentería RP (2008) Proceedings of International Conference on Computational Processing of Portuguese (PROPOR 2008), vol. 5190. In: Teixeira A, Lima VL, Oliveira LC, Quaresma P (eds), 143-52.. Springer, Berlin Heidelberg.
    6. Kepler FN, Finger M (2006) Advances in Artificial Intelligence - IBERAMIA-SBIA 2006. In: Sichman JS, Coelho H, Rezende SO (eds), 482-91.. Springer, Berlin Heidelberg.
    7. Maia MRdH, Xexéo GB (2011) Part-of-speech tagging of Portuguese using hidden Markov models with character language model emissions In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, 159-63, Cuiabà, Brazil.
    8. dos Santos CN, Zadrozny B (2014) Learning character-level representations for part-of-speech tagging In: Proceedings of the 31st International Conference on Machine Learning, 1818-826, Beijing, China.
    9. Turney, PD, Pantel, P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37: pp. 141-188
    10. Bengio, Y, Ducharme, R, Vincent, P, Jauvin, C (2003) A neural probabilistic language model. J Mach Learn Res 3: pp. 1137-1155
    11. Collobert R (2011) Deep learning for efficient discriminative parsing In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, vol. 15, 224-32, Ft. Lauderdale, USA.
    12. Socher R, Perelygin A, Wu J, Chuang J, Manning C, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, USA.
    13. Socher, R, Huval, B, Manning, CD, Ng, AY (2012) Semantic compositionality through recursive matrix-vector spaces. Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1201-1211
    14. Huang EH, Socher R, Manning CD, Ng AY (2012) Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 873-82.. Association for Computational Linguistics, Stroudsburg, PA, USA.
    15. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR.
    16. Namiuti, C (2004) O Corpus Anotado do Português Histórico: um Avan?o para as Pesquisas em Lingüística Histórica do Português. Revista Virtual de Estudos da Linguagem 2: pp. 1-9
    17. Afonso, S, Bick, E, Haber, R, Santos, D (2002) Floresta sintá(c)tica: a treebank for Portuguese. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). ELRA, Paris, pp. 1698-1703
    18. Duran, MS, Aluísio, SM (2010) Verbos auxiliares no português do Brasil. Technical Report NILC-10-05, University of S?o Paulo
    19. Collobert, R, Weston, J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. International Conference on Machine Learning, ICML. Helsinki, Finnland, pp. 160-167 CrossRef
    20. Haykin, S (1998) Neural networks: a comprehensive foundation. Prentice Hall PTR, Upper Saddle River, NJ, USA
    21. Viterbi, AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13: pp. 260-269 CrossRef
    22. Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, Kuksa, PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12: pp. 2493-2537
    23. Fonseca, ER, Rosa, JLG (2013) A two-step convolutional neural network approach for semantic role labeling. IJCNN 2013 -International Joint Conference on Neural Networks. IEEE, Dallas, USA, pp. 2955-2961
  • 刊物类别:Computer Science
  • 刊物主题:Computer Science, general
    Simulation and Modeling
    Data Structures
    Operating Systems
    Computer System Implementation
  • 出版者:Springer London
  • ISSN:1678-4804
文摘
Background Part-of-speech tagging is an important preprocessing step in many natural language processing applications. Despite much work already carried out in this field, there is still room for improvement, especially in Portuguese. We experiment here with an architecture based on neural networks and word embeddings, and that has achieved promising results in English. Methods We tested our classifier in different corpora: a new revision of the Mac-Morpho corpus, in which we merged some tags and performed corrections and two previous versions of it. We evaluate the impact of using different types of word embeddings and explicit features as input. Results We compare our tagger’s performance with other systems and achieve state-of-the-art results in the new corpus. We show how different methods for generating word embeddings and additional features differ in accuracy. Conclusions The work reported here contributes with a new revision of the Mac-Morpho corpus and a state-of-the-art new tagger available for use out-of-the-box.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700