基于可靠词汇语义约束的词语向量表达修正研究

英文篇名：Refining Word Vector Representation with Reliable Lexical Semantic Constraints
作者：梁泳诗 ; 黄沛杰 ; 黄培松 ; 杜泽峰
英文作者：LIANG Yongshi;HUANG Peijie;HUANG Peisong;DU Zefeng;College of Mathematic and Informatics,South China Agricultural University;
关键词：词语向量表达修正 ; 可靠词汇语义约束 ; 核心词约束传递
英文关键词：word vector representation refinement;;reliable lexical semantic constraints;;transmission mechanism of core words
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：华南农业大学数学与信息学院;
出版日期：2019-01-15
出版单位：中文信息学报
年：2019
期：v.33
基金：国家自然科学基金(71472068)
语种：中文;
页：MESS201901010
页数：12
CN：01
ISSN：11-2325/N
分类号：61-72

摘要

词语向量表达(word vector representation)是众多自然语言处理(natural language processing,NLP)下游应用的基础。已有研究采用各种词汇分类体系提供的词汇语义约束,对海量语料训练得到的词向量进行修正,改善了词向量的语义表达能力。然而,人工编制或者半自动构建的词汇分类体系普遍存在语义约束可靠性不稳定的问题。该文基于词汇分类体系与词向量之间、以及异构词汇分类体系之间的交互确认,研究适用于词语向量表达修正的可靠词汇语义约束提炼方法。具体上,对于词汇分类体系提供的同义词语类,基于词语向量计算和评估类内词语的可靠性。在其基础上,通过剔除不可靠语义约束机制避免词语类划分潜在不够准确的词语的错误修正;通过不同词汇分类体系的交互确认恢复了部分误剔除的语义约束;并通过核心词约束传递机制避免原始词向量不够可靠的词语在词向量修正中的不良影响。该文采用NLPCC-ICCPOL 2016词语相似度测评比赛中的PKU 500数据集进行测评。在该数据集上,将该文提出的方法提炼的可靠词汇语义约束应用到两个轻量级后修正的研究进展方法,修正后的词向量都获得更好的词语相似度计算性能,取得了0.649 7的Spearman等级相关系数,比NLPCC-ICCPOL 2016词语相似度测评比赛第一名的方法的结果提高25.4%。
Word vector representation is the basis for various natural language processing(NLP)systems.Studies have shown that word vectors trained from large corpora can be refined by semantic constraints in various lexical taxonomies.Based on lexicon-vectors interaction and the heterogeneous taxonomiesinteraction,we present the method of extracting reliable lexical semantic constraints to better refine word vectors representation.In this method,the word class knowledge from lexical taxonomies is assessed for reliability based on word vectorscalculation.Experimental results on PKU 500 from the NLPCC-ICCPOL 2016 shared task on Chinese word similarity measurement show that the proposed method outperforms in the word similarity calculation with a Spearman score 0.649 7,which gains 25.4%improvement comparing to the best result in the shared task.

引文

[1] Zou WY,et al.Bilingual word embeddings for phrasebased machine translation[C]//Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing(EMNLP 2013),2013:1393-1398.
    [2] Le Q,Mikolov T.Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning(ICML2014),2014.
    [3] Socher R,et al.Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing(EMNLP 2013),2013:1631-1642.
    [4] Harris ZS.Distributional structure[J].Word,1954:146-62.
    [5] Firth J R.A synopsis of linguistic theory[J].Studies in Linguistic Analysis,1957:1930-1955.
    [6] Bengio Y,et al.A neural probabilistic language model[J].Machine Learning Research,2003,3:1137-1155.
    [7] Mikolov T,et al.Efficient estimation of word representations in vector space[C]//Proceedings of the 1st International Conference on Learning Representations(ICLR 2013),2013.
    [8] Bian J,Gao B,Liu T.Knowledge-powered deep learning for word embedding[C]//Proceedings of the 7th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases(ECML-PKDD 2014),2014,8724:132-148.
    [9] Yu M,Dredze M.Improving lexical embeddings with semantic knowledge[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(ACL 2014),2014,1(1):545-550.
    [10] Xu C,et al.RC-NET:A general framework for incorporating knowledge into word representations[C]//Proceedings of the 23rd ACM International Conference on Information and Knowledge Management(CIKM 2014),2014:1219-1228.
    [11] Bollegala D,et al.Joint word representation learning using a corpus and a semantic lexicon[C]//Proceedings of the 13th AAAI Conference on Artificial Intelligence(AAAI 2016),2016,2690-2696.
    [12] Niu Y,et al.Improved word representation learning with sememes[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(ACL 2017),2017:2049-2058.
    [13] Faruqui M,et al.Retrofitting word vectors to semantic lexicons[C]//Proceedings of the 11th Annual Conference of the North American Chapter of the ACL(NAACL 2015),2015:1606-1615.
    [14] Mrk2i'c N,et al.Counter-fitting word vectors to linguistic constraints[C]//Proceedings of the 12th Annual Conference of the North American Chapter of the ACL(NAACL 2016),2016:142-148.
    [15] Mrk2i'c N,et al.Semantic specialisation of distributional word vector spaces using monolingual and cross-lingual constraints[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(ACL 2017),2017:309-324.
    [16] Vuli'c I,et al.Morph-fitting:fine-tuning word vector spaces with simple language-specific rules[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(ACL 2017),2017:56-68.
    [17] Li W,et al.Automated generalization of phrasal paraphrases from the web[C]//Proceedings of the 3rd International Workshop on Paraphrasing(IWP2005),2005:49-56.
    [18] Miller G A.WordNet:A lexical database for English[J].Communications of the ACM,1995,38(11):235-244.
    [19] Ganitkevitch J,Burch CC.The multilingual paraphrase database[C]//Proceedings of the 9th Language Resources and Evaluation Conference(LREC2014),2014:4276-4283.
    [20] Dong Z D,Dong Q.Hownet and the computation of meaning[M].World Scientific Publishing Company,Singapore,2006.
    [21] Pennington J,Socher R,Manning C D.Glove:global vectors for word representation[C]//Proceedings of the 20th Conference on Empirical Methods in Natural Language Processing(EMNLP 2014),2014:1532-1543.
    [22]梅家驹,等.同义词词林[M].上海:上海辞书出版社,1983:106-108.
    [23]刘群,李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76.
    [24]李峰,李芳.中文词语语义相似度计算——基于《知网》2000[J].中文信息学报,2007,21(3):99-105.
    [25] Wu YF,Li W.Overview of the NLPCC-ICCPOL2016shared task:Chinese word similarity measurement[J].Lecture Notes in Artificial Intelligence,2016,10102:828-839.
    [26] Guo SR,et al.Chinese word similarity computing based on combination strategy[J].Lecture Notes in Artificial Intelligence,2016,10102:744-752.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700