变体上下文窗口下的词向量准确性研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

变体上下文窗口下的词向量准确性研究

详细信息查看全文 | 推荐本文 |

英文篇名：Research on word vector accuracy using variant context window
作者：胡正 ; 杨志勇
英文作者：HU Zheng;YANG Zhiyong;School of Software,Nanchang Hangkong University;
关键词：词向量 ; 词嵌入 ; 上下文窗口 ; 自然语言处理 ; 神经网络 ; 深度学习
英文关键词：word vector;;word embedding;;context window;;natural language processing;;neural network;;deep learning
中文刊名：XDDJ
英文刊名：Modern Electronics Technique
机构：南昌航空大学软件学院;
出版日期：2019-03-13 07:02
出版单位：现代电子技术
年：2019
期：v.42;No.533
基金：国家自然科学基金资助项目(61501218)~~
语种：中文;
页：XDDJ201906036
页数：4
CN：06
ISSN：61-1224/TN
分类号：154-156+161

摘要

词向量的准确性在较大程度上影响了这些自然语言处理任务的运行。词向量通过词嵌入产生,在词嵌入的方法中,都将目标单词及其上下文作为训练的输入,因此上下文的选定对词嵌入有着重要的影响。文中通过使用word2vec词嵌入方法,研究各种变体上下文窗口对词嵌入准确度的影响。根据上下文窗口的各种宽度、偏移量、权值进行了一系列实验。从实验结果中发现,上下文窗口的变化只会对整体训练结果的准确性造成很小的影响,然而对于其中具体的各个单词却有显著影响。从而得出结论,即大量单词各自所适应的上下文窗口区别较大,而统一的上下文窗口难以实现对全部单词的最佳训练。
The word vector accuracy affects the operation of natural language processing tasks considerably. Word vectors are generated by the means of word embedding. In word embedding methods,the target words and their contexts are treated as inputs of the training. As a result,context determination has an important influence on word embedding. Therefore,the influ-ence of variant context windows on word embedding accuracy is studied by using the word2vec word embedding method in this paper. A series of experiments were carried out according to the context windows with variant widths,offsets and weights. The ex-perimental results show that,the variations of the context windows do not have a significant effect on the overall accuracy of training results,but have a significant effect on various specific words,so it is concluded that quite many words have their own demands in suitability of context windows,so it is difficult for a unified context window to implement the optimal training for all words.

引文

[1]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of machine learning research,2003,3:1137-1155.
    [2]SOCHER R,BAUER J,MANNING C D,et al.Parsing with compositional vector grammars[C]//Proceedings of 51st Annual Meeting of the Association for Computational Linguistics.[S.l.:s.n.],2013:455-465.
    [3]SOCHER R,PERELYGIN A,WU J Y,et al.Recursive deep models for semantic compositionality over a sentiment treebank[J/OL].[2017-03-13].https://nlp.stanford.edu/~socherr/EMN-LP2013_RNTN.pdf.
    [4]SIEN?NIK S K.Adapting word2vec to named entity recognition[C]//Proceedings of the 20th Nordic Conference of Computational Linguistics.Vilnius:Link?ping University Electronic Press,2015:239-243.
    [5]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in neural information processing systems,2013,26:3111-3119.
    [6]BARKAN O.Bayesian neural word embedding[J/OL].[2016-03-21].https://arxiv.org/ftp/arxiv/papers/1603/1603.06571.pdf.
    [7]LéBRET R,COLLOBERT R.Word embeddings through Hellinger PCA[J/OL].[2017-01-04].https://arxiv.org/pdf/1312.5542.pdf.
    [8]LEVY O,GOLDBERG Y.Neural word embedding as implicit matrix factorization[J].Advances in neural information processing systems,2014,3:2177-2185.
    [9]LI Y T,XU L L,TIAN F,et al.Word embedding revisited:a new representation learning and explicit matrix factorization perspective[C]//Proceedings of 24th International Conference on Artificial Intelligence.Buenos Aires:AAAI Press,2015:3650-3656.
    [10]GLOBERSON A,CHECHIK G,PEREIRA F,et al.Euclidean embedding of co-occurrence data[J].Journal of machine learning research,2007,8(4):2265-2295.
    [11]LEVY O,GOLDBERG Y.Linguistic regularities in sparse and explicit word representations[C]//Proceedings of Eighteenth Conference on Computational Natural Language Learning.[S.l.:s.n.],2014:171-180.
    [12]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J/OL].[2013-09-07].https://arxiv.org/pdf/1301.3781.pdf.
    [13]ZHILA A,YIH W,MEEK C,et al.Combining heterogeneous models for measuring relational similarity[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.[S.l.:s.n.],2013:1000-1009.
    [14]MIKOLOV T,YIH W T,ZWEIG G.Linguistic regularities in continuous space word representations[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Atlanta:Association for Computational Linguistics,2013:746-751.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700