藏文词向量相似度和相关性评测集构建

英文篇名：Construction of Tibetan Words Embedding Similarity and Relevance Evaluation Set
作者：才智杰 ; 孙茂松 ; 才让卓玛
英文作者：CAI Zhijie;SUN Maosong;CAI Rangzhuoma;College of Computer Science and Technology,Qinghai Normal University;Qinghai Provincial Key Laboratory of Tibetan Information Processing and Machine Translation;Key Laboratory of Tibetan Information Processing,Ministry of Education;Department of Computer Science and Technology,Tsinghua University;
关键词：自然语言处理 ; 藏文 ; 词向量 ; 评测集
英文关键词：natural language processing;;tibetan;;words embedding;;evaluation set
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：青海师范大学计算机学院;青海省藏文信息处理与机器翻译重点实验室;藏文信息处理教育部重点实验室;清华大学计算机科学与技术系;
出版日期：2019-07-15
出版单位：中文信息学报
年：2019
期：v.33
基金：国家自然科学基金(61866032,61163018);; 国家社会科学基金(13BYY141,16BYY167);; 教育部“春晖计划”合作科研项目(Z2012093,Z2016077);; 青海省基础研究项目(2017-ZJ-767,2019-SF-129);; “长江学者和创新团队发展计划”创新团队资助项目(IRT1068);; 青海省重点实验室项目(2013-Z-Y17,2014-Z-Y32,2015-Z-Y03);; 藏文信息处理与机器翻译重点实验室项目(2013-Y-17)
语种：中文;
页：MESS201907011
页数：8
CN：07
ISSN：11-2325/N
分类号：86-92+105

摘要

词向量评测是词向量研究的基础,包括内部评测(intrinsic evaluation)和外部评测(extrinsic evaluations)。外部评测是将得到的词向量应用到具体某个任务中进行评测,是词向量研究的目标。内部评测是通过建立词之间的语义相似度或相关性能力的评测集,评价词向量模型的性能,是一种常用的词向量评测方式。该文通过分析英文、汉文词向量评测集构建方法,结合藏文的特点,研究藏文词向量评测集构建方法,构建了用于评价藏文词向量相似度和相关性的评测集TWordSim215和TWordRel215,并分析其有效性。
Evaluation of words embedding as an essential issue in the research can be performed by intrinsic evaluation or extrinsic evaluation.The intrinsic evaluation,as a basic solution,usually demands an evaluation set describing the similarity or relevance among words.After examing the construction methods of words embedding evaluation sets of English and Chinese,this paper investigate the construction of Tibetan words embedding evaluation set according to the characteristic of Tibetan.The evaluation sets WordSim215 and TWordRel215 are constructed and analyzed for their effectiveness of evaluating Tibetan words embedding similarity and relevance.

引文

[1]D Lin.Automatic retrieval and clustering of similar words[C]//Proceedings of ACL/COLING,1998:768-774.
    [2]JR Curran,M Moens.Scaling context space[C]//Proceedings of ACL,2002:231-238.
    [3]G Dinu,M Lapata.Measuring distributional similarity in context[C]//Proceedings of EMNLP,2010:1162-1172.
    [4]A Budanitsky,G Hirst.Evaluating WordNet-based measures of lexical semantic relatedness[J].Computational Linguistics,2006,32(1):13-47.
    [5]A Fujii,T Hasegawa,T Tokunaga,et al.Integration of Hand-Crafted and statistical resources in measuring word similarity[C]//Proceedings of Workshop of Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications,1997:45-51.
    [6]L Finkelstein,E Gabrilovich,Y Matias,et al.Placing search in context:The concept revisited[C]//Proceedings of ACM Transactions on Information Systems,2002,20(1):116-131.
    [7]Word Vector Evaluation[EB/OL].http://wordvectors.org/,2018-8-30.
    [8]M Faruqui,C Dyer.Community evaluation and exchange of word vectors at wordvectors[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics:System Demonstrations,2014.
    [9]P Jin,Y Wu.SemEval-2012task 4:evaluating Chinese word similarity[C]//Proceedings of First Joint Conference on Lexical and Computational Semantics(SEM),2012:374-377.
    [10]X Chen,L Xu,Z Liu,et al.Joint learning of character and word embeddings[C]//Proceedings of International Joint Conference on Artificial Intelligence(IJCAI'15),2015.
    [11]才智杰,才让卓玛.藏文字符的向量模型及构件特征分析[J].中文信息学报,2016,30(2):202-206.
    [12]T Mikolov,K Chen,G corrado,J Dean.Efficient estimation of word representations in vector space[C]//Proceedings of 2013Workshop at International Conference on Learning Representations,2013.
    [13]来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学博士学位论文,2016.
    [14]JPennington,RSocher,CManning.GloVe:Global Vectors for Word Representation[C]//Proceedings of the Empiricial Methods in Natural Language Processing,2014.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700