向量模型和多源词汇分类体系相结合的词语相似性计算
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Word Similarity Computing by Integrating Vector-based Models with Multiple Lexical Taxonomies
  • 作者:梁泳诗 ; 黄沛杰 ; 岑洪杰 ; 唐杰聪 ; 王俊东
  • 英文作者:LIANG Yongshi;HUANG Peijie;CEN Hongjie;TANG Jiecong;WANG Jundong;College of Mathematic and Informatics,South China Agricultural University;
  • 关键词:词语相似性 ; 向量模型 ; 词汇分类体系 ; 组合方法 ; 多源融合
  • 英文关键词:word similarity;;vector-based model;;lexical taxonomy;;combinational method;;multi-source fusion
  • 中文刊名:MESS
  • 英文刊名:Journal of Chinese Information Processing
  • 机构:华南农业大学数学与信息学院;
  • 出版日期:2018-04-15
  • 出版单位:中文信息学报
  • 年:2018
  • 期:v.32
  • 基金:国家自然科学基金(71472068)
  • 语种:中文;
  • 页:MESS201804004
  • 页数:9
  • CN:04
  • ISSN:11-2325/N
  • 分类号:35-43
摘要
现有的词语语义相似性计算主要包括基于向量模型以及基于词汇分类体系两类方法,但这两类方法都存在自身的缺点。向量模型所依赖的文本共现中的上下文信息不等同于真正意义上的语义,而词汇分类体系方法则存在构建代价大,并且在一定程度上还不够完善的问题。该文提出一种向量模型与多源词汇分类体系相结合的词语相似性计算方法,采用多源词汇分类体系的近义词关系以及向量模型得到的词向量,计算得到词语的向量表达,并探索了不同类型词汇分类体系提供的知识的选用和融合问题,弥补了单一词向量和单一词汇分类体系在词语相似性计算中的缺点。该文采用了NLPCC-ICCPOL 2016词语相似度评测比赛中的PKU 500数据集进行评测。在该数据集上,该文的方法取得了0.637的斯皮尔曼等级相关系数,比NLPCC-ICCPOL 2016词语相似度评测比赛第一名的方法的结果提高了23%。
        Current semantic similarity computing can be classified as either vector-based or lexical taxonomy based approach.This paper proposes a method of semantic similarity by linking vector model to multi-source lexical taxonomies.In this method,vector representation of a word is calculated through distributed representation from vectorsbased models,and synonym relations are derived from multi-source lexical resource.Furthermore,this paper explores the way to select and fusion the knowledge from multiple lexical taxonomies.The combination strategy can alleviate the defects the two classical method.We experiment on PKU 500,the dataset of the NLPCC-ICCPOL 2016 shared task on Chinese word similarity measurement.Our method achieves a Spearman score 0.637,i.e.23%improvement comparing to the best result in the shared task.
引文
[1]Wu Y F,Li W.Overview of the NLPCC-ICCPOL2016shared task:Chinese word similarity measurement[J].Lecture Notes in Artificial Intelligence,2016,10102:828-839.
    [2]Turney P D.Similarity of semantic relations[J].Computational Linguistics,2006,32(3):379-416
    [3]Bengio Y,Ducharme R,Vincent P,et al.A neural probabilistic language model[J].The Journal of Machine Learning Research,2003(3):1137-1155.
    [4]Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations(ICLR 2013),2013.
    [5]Miller G A.WordNet:A lexical database for English[J].Communications of the ACM,1995,38(11):235-244.
    [6]Dong Z D,Dong Q.HowNet and the computation of meaning[M].World Scientific Publishing Company,Singapore,2006.
    [7]Li W,Liu T,Zhang Y,et al.Automated generalization of phrasal paraphrases from the web[C]//Proceedings of the 3rd International Workshop on Paraphrasing(IWP2005),2005:49-56.
    [8]Panchenko A.Best of both worlds:Making word sense embeddings interpretable[C]//Proceedings of the 10th Language Resources and Evaluation Conference(LREC 2016),2016:2649-2655.
    [9]Guo S R,Guan Y,Li R,et al.Chinese word similarity computing based on combination strategy[C]//Proceedings of NLPCC 2016,Lecture Notes in Artificial Intelligence,2016,10102:744-752.
    [10]Faruqui M,Dodge J,Jauhar S K,et al.Retrofitting word vectors to semantic lexicons[C]//Proceedings of the 2015Annual Conference of the North American Chapter of the ACL(NAACL 2015),2015:1606-1615.
    [11]Heylen K,Peirsmany Y,Geeraerts D,et al.Modeling word similarity:An evaluation of automatic synonym extraction algorithms[C]//Proceedings of the6th International Language Resources and Evaluation,2008,3243-3249.
    [12]Landauer T K,Dumais S T.A solution to plato's problem:The latent semantic analysis theory of acquisition,induction and representation of knowledge[J].Psychological Review,1997,104(2):211-240.
    [13]Baroni M,Zamparelli R.Nouns are vectors,adjectives are matrices:Representing adjective-noun constructions in semantic space[C]//Proceedings of the2010 Conference on Empirical Methods in Natural Language Processing(EMNLP 2010),2010:1183-1193.
    [14]Sérasset G.DBnary:Wiktionary as a lemon-based multilingual lexical resource in rdf[J].Semantic Web Journal-Special Issue on Multilingual Linked Open Data,2015,6(4):355-361.
    [15]Mikolov T,Sutskever I,Chen K,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 27th Annual Conference on Neural Information Processing Systems(NIPS 2013),2013b:3111-3119.
    [16]Morin F,Bengio Y.Hierarchical probabilistic neural network language model[C]//Proceedings of the International Workshop on Artificial Intelligence and Statistics(AISTATS 2005),2005:246-252.
    [17]刘群,李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76.
    [18]李峰,李芳.中文词语语义相似度计算·基于《知网》2000[J].中文信息学报,2007,21(3):99-105.
    [19]梅家驹,竺一鸣,高蕴琦,等.同义词词林[M].上海:上海辞书出版社,1983:106-108.
    [20]田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700