基于词性与词序的相关因子训练的word2vec改进模型

英文篇名：The Improved Model for word2vec Based on Part of Speech and Word Order
作者：潘博 ; 于重重 ; 张青川 ; 徐世璇 ; 曹帅
英文作者：PAN Bo;YU Chong-chong;ZHANG Qing-chuan;XU Shi-xuan;CAO Shuai;Department of Computer and Information Engineering,Beijing Technology and Business University;Academy of Social Sciences,Institute of Ethnology and Anthropology;
关键词：word ; embedding ; 词性 ; 相关权重 ; 词序 ; word2vec
英文关键词：word embedding;;part of speech;;relevance weights;;word order;;word2vec
中文刊名：DZXU
英文刊名：Acta Electronica Sinica
机构：北京工商大学计算机与信息工程学院;中国社会科学院民族学与人类学研究所;
出版日期：2018-08-15
出版单位：电子学报
年：2018
期：v.46;No.426
基金：教育部人文社会科学研究与规划基金(No.16YJAZH072);; 国家社会科学基金重大项目(No.14ZDB156);; 北京自然基金重点项目B类(No.KZ201410011014)
语种：中文;
页：DZXU201808024
页数：7
CN：08
ISSN：11-2087/TN
分类号：186-192

摘要

词性是自然语言处理的基本要素,词语顺序包含了所传达的语义与语法信息,它们都是自然语言中的关键信息.在word embedding模型中如何有效地将两者结合起来,是目前研究的重点.本文提出的Structured word2vec on POS联合了词语顺序与词性两种信息,不仅使模型可以感知词语位置顺序,而且利用词性关联信息来建立上下文窗口内词语之间的固有句法关系.Structured word2vec on POS将词语按其位置顺序定向嵌入,对词向量和词性相关加权矩阵进行联合优化.实验通过词语类比、词相似性任务,证明了所提出的方法的有效性.
Part of speech(POS) is the basic element of Natural Language Processing(NLP),word order consists of its conveyed semantic and syntax information,both are the key information of language. There is still lack of such a word embedding model that combines the two together as the influential element. This paper presents the Structured Word2 vec on POS that linked the two information of word order and POS together,not only enables the model to sense the words position and order,but alsouse the POS information to establish the inherent syntactic relation between words in the context window.Structured Word2 vec on POS is capable to directionally embed the words into context windowaccording to their position,and optimizes the word vector and POSrelevance weight matrix. Experiment through word analogy,word similarity task proved the effectiveness of our method.

引文

[1]Johnson R,Zhang T.Effective use of word order for text categorization w ith convolutional neural netw orks[J].ar Xiv Preprint ar Xiv:1412.1058,2014.
    [2]ZHAN C D,LING Z H,DAI L R.Learning word embeddings for paraphrase scoring in know ledge base based question answ ering[J].Pattern Recognition and Artificial Intelligence,2016,29(9):825-831.
    [3]尹存燕,黄书剑,戴新宇,等.中英命名实体识别及对齐中的中文分词优化[J].电子学报,2015,43(8):1481-1487.YIN C Y,et al.Optimization of Chinese w ord segmentation in named entity recognition and w ord alignment[J].Acta Electronica Sinica,2015,43(8):1481-1487.(in Chinese)
    [4]杨思春,戴新宇,陈家骏.面向开放域问答的问题分类技术研究进展[J].电子学报,2015,43(8):1627-1636.YANG S C,et al.Advances in question ciassification for open-domain question answ ering[J].Acta Electronica Sinica 2015,43(8):1481-1487.(in Chinese)
    [5]Mikolov T,Yih S W,Zweig G.Linguistic regularities in continuous space w ord representations[A].Conference of the North American Chapter of the Association of Computational linguistics[C].Atlanta,Georgia,USA:Association of Computational Linguistics,2013.746-751.
    [6]Bengio Y.Learning deep architectures for AI[J].Foundations and trends in M achine Learning,2009,2(1):1-127.
    [7]Chang K W,Yih W,Meek C.Multi-relational latent semantic analysis[A].EM NLP[C].Seattle,Washington,USA:Association for Computational Linguistics,2013.1602
    [8]Lund K,Burgess C.Hyperspace analogue to language(HAL):A general model semantic representation[J].Brain and Cognition,1996,30(3):5-5.
    [9]Lebret R,Collobert R.Word Emdeddings through Hellinger PCA[A].Conference of the European Chapter of the Association for Computational Linguistics(EACL)[C].Gothenburg,Sw eden:Association for Computational Linguistics,2014.482-490.
    [10]Bengio Y,Schwenk H,Senécal J S,et al.Neural Probabilistic Language M odels[M].Berlin Heidelberg:Springer,2006.137-186.
    [11]Zhang X,Gu N,Ye H.Multi-GPU based recurrent neural netw ork language model training[A].International Conference of Young Computer Scientists,Engineers and Educators[C].Singapore:Springer,2016.484-493.
    [12]Mikolov T,Karafiát M,Burget L,et al.Recurrent neural netw ork based language model[A].INTERSPEECH2010,Conference of the International Speech Communication Association[C].M akuhari,Chiba,Japan,2010.1045-1048.
    [13]Mikolov T,Chen K,Corrado G,et al.Efficient estimation of w ord representations in vector space[J].Computer Science,2013,5(4):243-254.
    [14]Mnih A,Kavukcuoglu K.Learning word embeddings efficiently w ith noise-contrastive estimation[A].Advances in Neural Information Processing Systems[C].Lake Tahoe,Nevada,United States,2013.2265-2273.
    [15]Levy O,Goldberg Y,Ramat-Gan I.Linguistic regularities in sparse and explicit w ord representations[A].Co NLL2014.Association for Computational Linguistics[C].Baltimore,M aryland,USA,2014.171-180.
    [16]Pennington J,Socher R,Manning C D.Glo Ve:global vectors for w ord representation[A].Association for Computational Linguistics[C].Doha,Qatar:Association for Computational Linguistics,2014,14:1532-43.
    [17]Ling W,Dyer C,Black A,et al.Two/too simple adaptations of w ord2vec for syntax problems[A].The 2015Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Denver[C].Colorado,USA:Association for Computational Linguistics,2015.1299-1304.
    [18]Liu Q,Ling Z H,Jiang H,et al.Part-of-speech relevance w eights for learning w ord embeddings[J].Arxiv Preprint Arxiv,2016,9(7):134-139.
    [19]Marcus M P,Marcinkiewicz M A,Santorini B.Building a large annotated corpus of English:the penn treebank[J].Computational Linguistics,1993,19(2):313-330.
    [20]Finkelstein L,Gabrilovich E,Matias Y,et al.Placing search in context:The concept revisited[A].Proceedings of the 10th International Conference on World Wide Web[C].New York:Association for Computational Linguistics,2001.406-414.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700