Vietnamese POS Tagging for Social Media Text
详细信息    查看全文
  • 关键词:Part ; of ; Speech tagging ; Social media text ; Conditional Random Fields
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2016
  • 出版时间:2016
  • 年:2016
  • 卷:9949
  • 期:1
  • 页码:233-242
  • 全文大小:553 KB
  • 参考文献:1.Albogamy, F., Ramsay, A.: POS tagging for Arabic tweets. In: Proceedings of RANLP, pp. 1–8 (2015)
    2.Aldarmaki, H., Diab, M.: Robust part-of-speech tagging of Arabic text. In: Proceedings of the 2nd Workshop on Arabic NLP, pp. 173–182 (2015)
    3.Bach, N.X., Hiraishi, K., Minh, N.L., Shimazu, A.: Dual decomposition for Vietnamese part-of-speech tagging. In: Proceedings of KES, pp. 123–131 (2013)
    4.Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Comput. Linguist. 21(4), 543–565 (1995)
    5.Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of ACL, pp. 42–47 (2011)
    6.Kawahara, D., Kurohashi, S., Hasida, K.: Construction of a Japanese relevance-tagged corpus. In: Proceedings of LREC, pp. 2008–2013 (2002)
    7.Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
    8.Le, H.P., Roussanaly, A., Nguyen, T.M.H., Rossignol, M.: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In: Proceedings of TALN (2010)
    9.Li, Z., Zhang, M., Che, W., Liu, T., Chen, W., Li, H.: Joint models for Chinese POS tagging and dependency parsing. In: Proceedings of EMNLP, pp. 1180–1191 (2011)
    10.Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
    11.Nakagawa, T., Kudo, T., Matsumoto, Y.: Revision learning and its application to part-of-speech tagging. In: Proceedings of ACL, pp. 497–450 (2002)
    12.Nakagawa, T., Uchimoto, K.: A hybrid approach to word segmentation and POS tagging. In: Proceedings of ACL, pp. 217–220 (2007)
    13.Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL. LNCS, vol. 8105, pp. 139–150. Springer, Heidelberg (2013)CrossRef
    14.Nghiem, M., Dinh, D., Nguyen, M.: Improving Vietnamese POS tagging by integrating a rich feature set and support vector machines. In: Proceedings of RIVF, pp. 128–133 (2008)
    15.Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large syntactically-annotated corpus of Vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP, pp. 182–185 (2009)
    16.Nguyen, L.M., Xuan, B.N., Viet, C.N., Nhat, M.P.Q., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: Proceedings of KSE, pp. 141–146 (2010)
    17.Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL, pp. 380–390 (2013)
    18.Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of EMNLP, pp. 133–142 (1996)
    19.Sha, F.P.: Shallow parsing with conditional random fields. In: Proceedings of NAACL, pp. 213–220 (2003)
    20.Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of EMNLP, pp. 63–70 (2000)
    21.Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of NAACL, pp. 252–259 (2003)
    22.Tran, T.O., Le, A.C., Ha, Q.T., Le, H.Q.: An experimental study on Vietnamese POS tagging. In: Proceedings of IALP, pp. 23–27 (2009)
    23.Tran, T.O., Le, A.C., Ha, Q.T.: Improving Vietnamese word segmentation and POS tagging using MEM with various kinds of resources. J. Nat. Lang. Process. 17(3), 41–60 (2010)CrossRef
    24.Vyas, Y., Gella, S.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of EMNLP, pp. 974–979 (2014)
    25.Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of EMNLP, pp. 647–657 (2013)
  • 作者单位:Ngo Xuan Bach (19) (20)
    Nguyen Dieu Linh (19)
    Tu Minh Phuong (19) (20)

    19. Department of Computer Science, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
    20. FPT Software Research Lab, Hanoi, Vietnam
  • 丛书名:Neural Information Processing
  • ISBN:978-3-319-46675-0
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
  • 卷排序:9949
文摘
This paper presents an empirical study on Vietnamese part-of-speech (POS) tagging for social media text, which shows several challenges compared with tagging for general text. Social media text does not always conform to formal grammars and correct spelling. It also uses abbreviations, foreign words, and icons frequently. A POS tagger developed for conventional, edited text would perform poorly on such noisy data. We address this problem by proposing a tagging model based on Conditional random fields with various kinds of features for Vietnamese social media text. We introduce a corpus for POS tagging, which consists of more than four thousands sentences from Facebook, the most popular social network in Vietnam. Using this corpus, we performed a series of experiments to evaluate the proposed model. Our model achieved 88.26 % tagging accuracy, which is 11.27 % improvement over a state-of-the-art Vietnamese POS tagger developed for general, conventional text.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700