摘要
词向量评测是词向量研究的基础,包括内部评测(intrinsic evaluation)和外部评测(extrinsic evaluations)。外部评测是将得到的词向量应用到具体某个任务中进行评测,是词向量研究的目标。内部评测是通过建立词之间的语义相似度或相关性能力的评测集,评价词向量模型的性能,是一种常用的词向量评测方式。该文通过分析英文、汉文词向量评测集构建方法,结合藏文的特点,研究藏文词向量评测集构建方法,构建了用于评价藏文词向量相似度和相关性的评测集TWordSim215和TWordRel215,并分析其有效性。
Evaluation of words embedding as an essential issue in the research can be performed by intrinsic evaluation or extrinsic evaluation.The intrinsic evaluation,as a basic solution,usually demands an evaluation set describing the similarity or relevance among words.After examing the construction methods of words embedding evaluation sets of English and Chinese,this paper investigate the construction of Tibetan words embedding evaluation set according to the characteristic of Tibetan.The evaluation sets WordSim215 and TWordRel215 are constructed and analyzed for their effectiveness of evaluating Tibetan words embedding similarity and relevance.
引文
[1]D Lin.Automatic retrieval and clustering of similar words[C]//Proceedings of ACL/COLING,1998:768-774.
[2]JR Curran,M Moens.Scaling context space[C]//Proceedings of ACL,2002:231-238.
[3]G Dinu,M Lapata.Measuring distributional similarity in context[C]//Proceedings of EMNLP,2010:1162-1172.
[4]A Budanitsky,G Hirst.Evaluating WordNet-based measures of lexical semantic relatedness[J].Computational Linguistics,2006,32(1):13-47.
[5]A Fujii,T Hasegawa,T Tokunaga,et al.Integration of Hand-Crafted and statistical resources in measuring word similarity[C]//Proceedings of Workshop of Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications,1997:45-51.
[6]L Finkelstein,E Gabrilovich,Y Matias,et al.Placing search in context:The concept revisited[C]//Proceedings of ACM Transactions on Information Systems,2002,20(1):116-131.
[7]Word Vector Evaluation[EB/OL].http://wordvectors.org/,2018-8-30.
[8]M Faruqui,C Dyer.Community evaluation and exchange of word vectors at wordvectors[C]//Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics:System Demonstrations,2014.
[9]P Jin,Y Wu.SemEval-2012task 4:evaluating Chinese word similarity[C]//Proceedings of First Joint Conference on Lexical and Computational Semantics(SEM),2012:374-377.
[10]X Chen,L Xu,Z Liu,et al.Joint learning of character and word embeddings[C]//Proceedings of International Joint Conference on Artificial Intelligence(IJCAI'15),2015.
[11]才智杰,才让卓玛.藏文字符的向量模型及构件特征分析[J].中文信息学报,2016,30(2):202-206.
[12]T Mikolov,K Chen,G corrado,J Dean.Efficient estimation of word representations in vector space[C]//Proceedings of 2013Workshop at International Conference on Learning Representations,2013.
[13]来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学博士学位论文,2016.
[14]JPennington,RSocher,CManning.GloVe:Global Vectors for Word Representation[C]//Proceedings of the Empiricial Methods in Natural Language Processing,2014.