Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese

详细信息查看全文

作者：Erick R Fonseca ; Jo?o Luís G Rosa…
关键词：Natural language processing ; Part ; of ; speech tagging ; Neural networks ; Word embeddings
刊名：Journal of the Brazilian Computer Society
出版年：2015
出版时间：December 2015
年：2015
卷：21
期：1
全文大小：850 KB
参考文献：1. The Penn Treebank Project (2014). http://www.cis.upenn.edu/~treebank/ Accessed April 2014.
2. Fonseca ER, Rosa JLG (2013) Mac-Morpho revisited: towards robust part-of-speech In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology -STIL, 98-07, Fortaleza, Brazil.
3. Aluísio, S, Pelizzoni, J, Marchi, A. R, de Oliveira, L, Manenti, R, Marquiafável, V (2003) An account of the challenge of tagging a reference corpus for brazilian portuguese. Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR-3, Springer, Berlin, Heidelberg, pp. 110-117 CrossRef
4. Bick E (2000) The parsing system PALAVRAS: automatic grammatical analysis of Portuguese in a constraint grammar framework. PhD thesis. Department of Linguistics -Aarhus University.
5. dos Santos CN, Milidiú RL, Rentería RP (2008) Proceedings of International Conference on Computational Processing of Portuguese (PROPOR 2008), vol. 5190. In: Teixeira A, Lima VL, Oliveira LC, Quaresma P (eds), 143-52.. Springer, Berlin Heidelberg.
6. Kepler FN, Finger M (2006) Advances in Artificial Intelligence - IBERAMIA-SBIA 2006. In: Sichman JS, Coelho H, Rezende SO (eds), 482-91.. Springer, Berlin Heidelberg.
7. Maia MRdH, Xexéo GB (2011) Part-of-speech tagging of Portuguese using hidden Markov models with character language model emissions In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, 159-63, Cuiabà, Brazil.
8. dos Santos CN, Zadrozny B (2014) Learning character-level representations for part-of-speech tagging In: Proceedings of the 31st International Conference on Machine Learning, 1818-826, Beijing, China.
9. Turney, PD, Pantel, P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37: pp. 141-188
10. Bengio, Y, Ducharme, R, Vincent, P, Jauvin, C (2003) A neural probabilistic language model. J Mach Learn Res 3: pp. 1137-1155
11. Collobert R (2011) Deep learning for efficient discriminative parsing In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, vol. 15, 224-32, Ft. Lauderdale, USA.
12. Socher R, Perelygin A, Wu J, Chuang J, Manning C, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, USA.
13. Socher, R, Huval, B, Manning, CD, Ng, AY (2012) Semantic compositionality through recursive matrix-vector spaces. Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1201-1211
14. Huang EH, Socher R, Manning CD, Ng AY (2012) Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 873-82.. Association for Computational Linguistics, Stroudsburg, PA, USA.
15. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR.
16. Namiuti, C (2004) O Corpus Anotado do Português Histórico: um Avan?o para as Pesquisas em Lingüística Histórica do Português. Revista Virtual de Estudos da Linguagem 2: pp. 1-9
17. Afonso, S, Bick, E, Haber, R, Santos, D (2002) Floresta sintá(c)tica: a treebank for Portuguese. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). ELRA, Paris, pp. 1698-1703
18. Duran, MS, Aluísio, SM (2010) Verbos auxiliares no português do Brasil. Technical Report NILC-10-05, University of S?o Paulo
19. Collobert, R, Weston, J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. International Conference on Machine Learning, ICML. Helsinki, Finnland, pp. 160-167 CrossRef
20. Haykin, S (1998) Neural networks: a comprehensive foundation. Prentice Hall PTR, Upper Saddle River, NJ, USA
21. Viterbi, AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13: pp. 260-269 CrossRef
22. Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, Kuksa, PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12: pp. 2493-2537
23. Fonseca, ER, Rosa, JLG (2013) A two-step convolutional neural network approach for semantic role labeling. IJCNN 2013 -International Joint Conference on Neural Networks. IEEE, Dallas, USA, pp. 2955-2961
刊物类别：Computer Science
刊物主题：Computer Science, general
Simulation and Modeling
Data Structures
Operating Systems
Computer System Implementation
出版者：Springer London
ISSN：1678-4804

文摘

Background Part-of-speech tagging is an important preprocessing step in many natural language processing applications. Despite much work already carried out in this field, there is still room for improvement, especially in Portuguese. We experiment here with an architecture based on neural networks and word embeddings, and that has achieved promising results in English. Methods We tested our classifier in different corpora: a new revision of the Mac-Morpho corpus, in which we merged some tags and performed corrections and two previous versions of it. We evaluate the impact of using different types of word embeddings and explicit features as input. Results We compare our tagger’s performance with other systems and achieve state-of-the-art results in the new corpus. We show how different methods for generating word embeddings and additional features differ in accuracy. Conclusions The work reported here contributes with a new revision of the Mac-Morpho corpus and a state-of-the-art new tagger available for use out-of-the-box.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700