文摘
This paper presents an empirical study on Vietnamese part-of-speech (POS) tagging for social media text, which shows several challenges compared with tagging for general text. Social media text does not always conform to formal grammars and correct spelling. It also uses abbreviations, foreign words, and icons frequently. A POS tagger developed for conventional, edited text would perform poorly on such noisy data. We address this problem by proposing a tagging model based on Conditional random fields with various kinds of features for Vietnamese social media text. We introduce a corpus for POS tagging, which consists of more than four thousands sentences from Facebook, the most popular social network in Vietnam. Using this corpus, we performed a series of experiments to evaluate the proposed model. Our model achieved 88.26 % tagging accuracy, which is 11.27 % improvement over a state-of-the-art Vietnamese POS tagger developed for general, conventional text.