摘要
传统词向量表示模型往往忽视了单词间的句法形态结构,导致模型预测准确率不高.为此,提出基于fastText模型的词向量表示改进算法.首先,在训练模型数据集上,引入stopwords处理技术,剔除一些无意义介词等对预测模型干扰,减少噪声数据;其次,针对fastText模型中n-gram分解格式进行限定,将分解条件设置为符合英文单词的组成结构;最后,去除fastText模型中单词前后缀标记符,减少无用分解对模型预测产生干扰.实验结果表明,与fastText模型相比,所提出的改进模型在单词关系评分、语义相似性、句法相似性均取得较好的准确率.
The traditional word vector representation model ignores the syntactic morphological structure between words, which leads to the low prediction accuracy of the model. In this paper, we propose an improved word vector representation algorithm based on fastText model. Firstly, we introduce stopwords processing technology on the training model datasets to eliminate the interference of meaningless prepositions to the prediction model, and reduce noise data. Secondly, the n-gram decomposition format in the fastText model is limited, the decomposition condition is set to conform to the composition structure of English words. Finally, the word prefix and suffix markers in fastText model are removed to reduce the interference caused by useless decomposition to the model prediction. Experimental results show that compared with the fastText model, the improved model proposed in this paper achieves better accuracy in word relationship score, semantic similarity and syntactic similarity.
引文
[1] GE L,MOH T.Improving text classification with word embedding[C]//IEEE International Conference on Big Data.Boston:IEEE,2017:1796-1805.
[2] MARULLI F,POTA M,ESPOSITO M.A comparison of character and word embeddings in bidirectional LSTMs for POS tagging in Italian[C]//International Conference on Intelligent Interactive Multimedia Systems and Services.Gold Coast:Springer,2018:14-23.
[3] ARSLAN Y,DILEK Kü?üK,BIRTURK A.Twitter sentiment analysis experiments using word embeddings on datasets of various scales[C]//International Conference on Applications of Natural Language to Information Systems.Paris:Springer,2018:40-47.
[4] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[5] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J/OL].Computer Science,2013:1-12.arXiv:1301.3781.
[6] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[7] MIKOLOV T,KARAFIáT M,BURGET L,et al.Recurrent neural network based language model[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association.Makuhari:DBLP,2010:1045-1048.
[8] BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[C]//The Conference on Transactions of the Association for Computational Linguistics.Prague:ACL,2017:135-146.
[9] LE Q V,MIKOLOV T.Distributed representations of sentences and documents[C]//ICML14 Proceedings of the 31st International Conference on International Conference on Machine Learning.Beijing:ACM,2014:1188-1196.
[10] PENNINGTON J,SOCHER R,MANNING C.Glove:global vectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing.Doha:ACL,2014:1532-1543.
[11] WIETING J,BANSAL M,GIMPEL K,et al.Charagram:embedding words and sentences via character n-grams[C]//The Conference on Empirical Methods on Natural Language Processing.Texas:ACL,2016:1504-1515.
[12] FANI H,BASHARI M,ZARRINKALAM F,et al.Stopword detection for streaming content[C]//European Conference on Information Retrieval.Cham:Springer,2018:737-743.
[13] 康奈尔大学.IMDB上影评数据集[DB/OL].(2005-07-01)[2018-06-15].http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz/.
[14] FINKELSTEIN R L.Placing search in context:the concept revisited[J].ACM Transactions on Information Systems,2002,20(1):116-131.
[15] LUONG T,SOCHER R,MANNING C D.Better word representations with recursive neural networks for morphology[C]//Proceedings of the Seventeenth Conference on Computational Natural Language Learning.Sofia:ACL,2013:104-113.