基于fastText模型的词向量表示改进算法

英文篇名：Base on fastText model to improve the word embedding of phrases and morphology
作者：阴爱英 ; 吴运兵 ; 郑一江 ; 余小燕
英文作者：YIN Aiying;WU Yunbing;ZHENG Yijiang;YU Xiaoyan;Department of Computer Engineering, Zhicheng College of Fuzhou University;College of Mathematics and Computer Science, Fuzhou University;
关键词：词向量 ; skip-gram模型 ; fastText模型 ; 自然语言处理
英文关键词：word embedding;;skip-gram model;;fastText model;;natural language processing
中文刊名：FZDZ
英文刊名：Journal of Fuzhou University(Natural Science Edition)
机构：福州大学至诚学院计算机工程系;福州大学数学与计算机科学学院;
出版日期：2019-05-21 09:31
出版单位：福州大学学报(自然科学版)
年：2019
期：v.47;No.229
基金：福建省自然科学基金资助项目(2017J01755);; 福建省教育厅中青年教师教育科研项目(JAT170102)
语种：中文;
页：FZDZ201903006
页数：6
CN：03
ISSN：35-1117/N
分类号：34-39

摘要

传统词向量表示模型往往忽视了单词间的句法形态结构,导致模型预测准确率不高.为此,提出基于fastText模型的词向量表示改进算法.首先,在训练模型数据集上,引入stopwords处理技术,剔除一些无意义介词等对预测模型干扰,减少噪声数据;其次,针对fastText模型中n-gram分解格式进行限定,将分解条件设置为符合英文单词的组成结构;最后,去除fastText模型中单词前后缀标记符,减少无用分解对模型预测产生干扰.实验结果表明,与fastText模型相比,所提出的改进模型在单词关系评分、语义相似性、句法相似性均取得较好的准确率.
The traditional word vector representation model ignores the syntactic morphological structure between words, which leads to the low prediction accuracy of the model. In this paper, we propose an improved word vector representation algorithm based on fastText model. Firstly, we introduce stopwords processing technology on the training model datasets to eliminate the interference of meaningless prepositions to the prediction model, and reduce noise data. Secondly, the n-gram decomposition format in the fastText model is limited, the decomposition condition is set to conform to the composition structure of English words. Finally, the word prefix and suffix markers in fastText model are removed to reduce the interference caused by useless decomposition to the model prediction. Experimental results show that compared with the fastText model, the improved model proposed in this paper achieves better accuracy in word relationship score, semantic similarity and syntactic similarity.

引文

[1] GE L,MOH T.Improving text classification with word embedding[C]//IEEE International Conference on Big Data.Boston:IEEE,2017:1796-1805.
    [2] MARULLI F,POTA M,ESPOSITO M.A comparison of character and word embeddings in bidirectional LSTMs for POS tagging in Italian[C]//International Conference on Intelligent Interactive Multimedia Systems and Services.Gold Coast:Springer,2018:14-23.
    [3] ARSLAN Y,DILEK Kü?üK,BIRTURK A.Twitter sentiment analysis experiments using word embeddings on datasets of various scales[C]//International Conference on Applications of Natural Language to Information Systems.Paris:Springer,2018:40-47.
    [4] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
    [5] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J/OL].Computer Science,2013:1-12.arXiv:1301.3781.
    [6] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
    [7] MIKOLOV T,KARAFIáT M,BURGET L,et al.Recurrent neural network based language model[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association.Makuhari:DBLP,2010:1045-1048.
    [8] BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[C]//The Conference on Transactions of the Association for Computational Linguistics.Prague:ACL,2017:135-146.
    [9] LE Q V,MIKOLOV T.Distributed representations of sentences and documents[C]//ICML14 Proceedings of the 31st International Conference on International Conference on Machine Learning.Beijing:ACM,2014:1188-1196.
    [10] PENNINGTON J,SOCHER R,MANNING C.Glove:global vectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing.Doha:ACL,2014:1532-1543.
    [11] WIETING J,BANSAL M,GIMPEL K,et al.Charagram:embedding words and sentences via character n-grams[C]//The Conference on Empirical Methods on Natural Language Processing.Texas:ACL,2016:1504-1515.
    [12] FANI H,BASHARI M,ZARRINKALAM F,et al.Stopword detection for streaming content[C]//European Conference on Information Retrieval.Cham:Springer,2018:737-743.
    [13] 康奈尔大学.IMDB上影评数据集[DB/OL].(2005-07-01)[2018-06-15].http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz/.
    [14] FINKELSTEIN R L.Placing search in context:the concept revisited[J].ACM Transactions on Information Systems,2002,20(1):116-131.
    [15] LUONG T,SOCHER R,MANNING C D.Better word representations with recursive neural networks for morphology[C]//Proceedings of the Seventeenth Conference on Computational Natural Language Learning.Sofia:ACL,2013:104-113.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700