《同义词词林》的嵌入表示与应用评估

英文篇名：An Embedded Representation for "Tongyici Cilin" and Its Evaluation on Tasks
作者：段宇光 ; 刘扬 ; 俞士汶
英文作者：DUAN Yuguang;LIU Yang;YU Shiwen;Key Laboratory of Computational Linguistics(Ministry of Education),Peking University;Yuanpei College,Peking University;Institute of Computational Linguistics,Peking University;
关键词：《同义词词林》 ; 嵌入表示 ; 词义合成 ; 类比推理 ; 相似度
英文关键词：" Tongyici Cilin";;embedded representation;;semantic compositionality;;analogical reasoning;;similarity
中文刊名：XDZK
英文刊名：Journal of Xiamen University(Natural Science)
机构：北京大学计算语言学教育部重点实验室;北京大学元培学院;北京大学计算语言学研究所;
出版日期：2018-10-19 10:53
出版单位：厦门大学学报(自然科学版)
年：2018
期：v.57;No.267
基金：国家重点基础研究发展计划(973计划)(2014CB340504);; 国家社会科学基金重大项目(12&ZD119);国家社会科学基金(16BYY137)
语种：中文;
页：XDZK201806014
页数：9
CN：06
ISSN：35-1070/N
分类号：133-141

摘要

在自然语言处理中,嵌入表示是表达语言知识的重要途径和手段,以《同义词词林》为例,提出基于知识库训练嵌入表示的伪句式构造方法,并在多项任务上测试新方法的有效性.根据《同义词词林》词义编码反映的层级结构,将这些编码扩展为多种伪句式,并据此生成不同的伪语料库,采用word2vec模型在伪语料库上训练义素向量及词向量,得到CiLin2Vec资源,并应用于词义合成、类比推理和词义相似度计算等任务.在词义合成、类比推理任务上的准确率达到90%以上,超过了以往在语料库上训得的结果.证明该方法可以有效地将知识库中的理性知识注入嵌入表示中,也显示了CiLin2Vec嵌入表示资源在应用上的巨大潜力.
In natural language processing(NLP),to learn embedded representation is an effective approach of capturing semantics from language resources.At present,however,this approach has been much limited to using large-scale corpora,with little attention to extracting rational knowledge from knowledge bases.In this paper,based on " Tongyici Cilin",a famous Chinese thesaurus,we present a method for implanting rational knowledge into embedded representation,then evaluate it in terms of different NLP tasks.According to the hierarchical encodings for morphemic and lexical meanings in " Tongyici Cilin",we design multiple templates to create instances as pseudo-sentences from these pieces of knowledge,and apply word2 vec to obtain CiLin2 Vec,the sememe and word embeddings of new kinds as for " Tongyici Cilin".For evaluation,tasks of semantic compositionality,analogical reasoning and word similarity measurement are taken into consideration.We make progress and breakthrough on the tasks,reaching an accuracy of over90%for both semantic compositionality and analogical reasoning,demonstrating that the pieces of rational knowledge have been appropriately implanted,with very promising prospects for adoption of the knowledge bases.

引文

[1]田久乐,赵蔚.基于同义词词林的词相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.
    [2]吕立辉,梁维薇,冉蜀阳.基于《词林》的词相似度的度量[J].现代计算机(专业版),2013,1:3-6.
    [3]朱新华,马润聪,孙柳,等.基于知网与《词林》的词语义相似度计算[J].中文信息学报,2016,30(4):29-36.
    [4]刘丹丹,彭成,钱龙华,等.《同义词词林》在中文实体关系抽取中的作用[J].中文信息学报,2014,28(2):91-99.
    [5]徐庆,段利国,李爱萍,等.基于实体词义相似度的中文实体关系抽取[J].山东大学学报(工学版),2015,45(6):7-15.
    [6]李国臣,吕雷,王瑞波,等.基于同义词词林信息特征的语义角色自动标注[J].中文信息学报,2016,30(1):101-108.
    [7]王东,熊世桓.基于同义词词林扩展的短文本分类[J].兰州理工大学学报,2015,4:104-108.
    [8] DEERWESTER S,DUMAIS S T,FURNAS G W et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
    [9] SCHUTZE H.Dimensions of meaning[C]∥Proceedings of the 1992ACM/IEEE Conference on Supercomputing.California:IEEE,1992:787-796.
    [10] LUND K,BURGESS C.Producing high-dimensional semantic spaces from lexical co-occurrence[J].Behavior Research Methods,Instruments,&Computers,1996,28(2):203-208.
    [11] COLLOBERT R,WESTON J.A unified architecture for natural language processing:deep neural networks with multitask learning[C]∥International Conference on Machine Learning.Helsinki:ACM,2008:160-167.
    [12] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing(almost)from scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
    [13] TURNEY P D.Domain and function:a dual-space model of semantic relations and compositions[J].Journal of Artificial Intelligence Research,2012,44:533-585.
    [14] PENNINGTON J,SOCHER R,MANNING C D.Glove:global vectors for word representation[C]∥Conference on Empirical Methods on Natural Language Processing. Doha:AssociationforComputational Linguistics,2014:1532-1543.
    [15] BARTUSIAK R,AUGUSTYNIAK L,KAJDANOWICZ T,et al.WordNet2Vec:corpora agnostic word vectorization method[J].Neurocomputing,2017.doi:10.1016/j.neucom.2017.01.121.
    [16] TISSIER J,GRAVIER C,HABRARD A.Dict2vec:learning word embeddings using lexical dictionaries[C]∥Conference on Empirical Methods in Natural Language Processing.Copenhagen:Association for Computational Linguistics,2017:254-263.
    [17] ROTHE S,SCHUTZE H.AutoExtend:extending word embeddings to embeddings for synsets and lexemes[EB/OL].[2018-04-20].http:∥arxiv.org/pdf/1507.0112701.pdf.
    [18] PANCHENKO A.Best of both worlds:making word sense embeddings interpretable[C]∥Edition of the Language Resources and Evaluation Conference.Portoro6:ELRA,2016:2649-2655.
    [19] YANG L,SUN M.Improved learning of Chinese word embeddings with semantic knowledge[M]∥Chinese computationallinguisticsandnaturallanguage processing based on naturally annotated big data.Switzerland:Springer,2015:15-25.
    [20] GOIKOETXEA J,SOROA,AGIRRE E.Random walks and neural network language models on knowledge bases[C]∥Proceedings of the 2015Annual Conference of the North American Chapter of the ACL.San Diego:ACL,2015:1434-1439.
    [21]梅家驹,竺一鸣,高蕴奇,等.同义词词林[M].上海:上海辞书出版社,1983:1-362.
    [22] HARRIS Z.Distributional structure[J].Word,1954,10(2):146-162.
    [23] HINTON G E,MCCELLAND J L,RUMELHART D E.Distributed respresentations[M]∥RUMELHART D E,MCCLELLAND J L.Parallel distributed processing:explorations in the microstructure of cognition(volume 1).Cambridge:MIT,1986:77-109.
    [24]孙飞,郭嘉丰,兰艳艳,等.分布式单词表示综述[J].计算机学报,2016,39:1-22.
    [25] CHOMSKY N.Three models for the description of language[J].IRE Transactions on Information Theory,1956,2(3):113-124.
    [26] YESSENALINA A,CARDIE C.Compositional matrixspace models for sentiment analysis[C]∥Conference on Empirical Methods on Natural Language Processing.Edinburgh:Association for Computational Linguistics,2011:172-182.
    [27] SOCHER R,HUVAL B,MANNING C D,et al.Semantic compositionality through recursive matrixvector spaces[C]∥Conference on Empirical Methods on Natural Language Processing.Jeju Island:Association for Computational Linguistics,2012:1201-1211.
    [28] GREFENSTETTE E,DINU G,ZHANG Y Z,et al.Multi-stepregressionlearningforcompositional distributional semantics[EB/OL].(2013-01-29)[2018-04-01].http:∥cn.arXiv.org/abs/:1301.6939.
    [29] FODOR J A,PYLYSHYN Z W.Connectionism and cognitive architecture:a critical analysis[J].Cognition,1988,28(1/2):3-71.
    [30] GERSHMAN S,TENENBAUM J B.Phrase similarity in humans and machines[C]∥Proceedings of the 37th Annual Conference of the Cognitive Science Society.Cambridge:MIT,2015:776-781.
    [31] VAKULENKO S.The notion of sememe in the work of Adolf Noreen[J].Henry Sweet Society for the History of Linguistic Ideas Bulletin,2005(44):19-35.
    [32] LYONS J.Linguistic semantics[M].Cambridge:Cambridge University Press,1996.
    [33] MIKOLOV T,YIH W T,ZWEIG G.Linguistic regularities in continuous space word representations[C]∥Proceeding of the 2013Conference of the North American Chapter of the ACL.Atlanta:Association for Computational Linguistics,2013:746-751.
    [34] CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]∥Proceedings of IJCAI.Buenos Aires:AAAI,2015:1236-1242.
    [35]葛斌,李芳芳,郭丝路,等.基于知网的词汇语义相似度计算方法研究[J].计算机应用研究,2010,27(9):3329-3333.
    [36]石静,吴云芳,邱立坤,等.基于大规模语料库的汉语词义相似度计算方法[J].中文信息学报,2013,27(1):1-6.
    [37] LI Y,BANDAR Z A,MCLEAN D.An approach for measuring semantic similarity between words using multiple information sources[J].IEEE Transactions on Knowledge and Data Engineering,2003,15(4):871-882.
    [38]梅立军,周强,臧路,等.知网与同义词词林的信息融合研究[J].中文信息学报,2005,19(1):64-71.
    [39] TAIEB M A H,AOUICHA M B,HAMADOU A B.Ontology-based approach for measuring semantic similarity[J].Engineering Applications of Artificial Intelligence,2014,36:238-261.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700