基于中心化相似度矩阵的词向量方法

英文篇名：Method of word vector based on centralization similarity matrix
作者：徐帆 ; 王裴岩 ; 蔡东风
英文作者：Xu Fan;Wang Peiyan;Cai Dongfeng;Human-Computer Intelligence Research Center,Shenyang Aerospace University;
关键词：词向量 ; 中心化 ; 相似度矩阵
英文关键词：word vector;;centralization;;similarity matrix
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：沈阳航空航天大学人机智能研究中心;
出版日期：2018-02-08 17:53
出版单位：计算机应用研究
年：2019
期：v.36;No.328
基金：辽宁省自然科学基金计划重点项目(20170540705);; 国家自然科学基金资助项目(61403262)
语种：中文;
页：JSYJ201902022
页数：5
CN：02
ISSN：51-1196/TP
分类号：97-100+120

摘要

对基于矩阵分解的词向量方法进行了研究,发现降维前相似度矩阵质量与词向量质量存在线性相关性,提出了一种基于中心化相似度矩阵的方法。该方法使得相似(不相似或弱相似)词间的相似程度相对增强(减弱)。在WS-353和RW数据集的词语相似性实验中验证了所提出方法的有效性,两个数据集下词向量质量最高提升0. 289 6和0. 180 1。中心化能够提升降维前相似度矩阵质量,进而提升词向量质量。
This paper studied the method of word vector based on matrix factorization. It found that there was a linear correlation between the quality of no dimension reduction matrix and the quality of word vector. Furthermore,it derived a method of the word vector,which based on a kind of centring similarity matrix. This method made the similarity between similar(dissimilar or weakly similar) words relatively enhanced(weakened). In the word similarity experiments of WS-353 and RW datasets,it verified the effectiveness of the proposed method. The highest quality of the word vectors among the two datasets is0. 289 6 and 0. 180 1. Centralization can improve the quality of similarity matrix,moreover it can improve the quality of word vector.

引文

[1] Collobert R,Weston J. A unified architecture for natural language processing:deep neural networks with multitask learning[C]//Proc of the 25th International Conference on Machine Learning. New York:ACM Press,2008:160-167.
    [2] Bengio Y,Schwenk H,Senécal J,et al. Neural probabilistic language models[J]. Journal of Machine Learning Research,2003,3(6):1137-1155.
    [3] Lai Siwei,Liu Kang,Xu Liheng,et al. How to generate a good word embedding[J]. IEEE Intelligent Systems,2016,31(6):5-14.
    [4]于东,荀恩东.基于word embedding语义相似度的字母缩略术语消歧[J].中文信息学报,2014,28(5):51-59.(Yu Dong,Xun Endong. Acronym term disambiguation based on semantic similarity calculated by word embedding[J]. Journal of Chinese Information Processing,2014,28(5):51-59.)
    [5] Socher R,Perelygin A,Wu J Y,et al. Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proc of Empirical Methods in Natural Language Processing. 2013:1631-1642.
    [6] Kim Y. Convolutional neural networks for sentence classification[C]//Proc of Empirical Methods in Natural Language Processing.2014:1746-1751.
    [7] Collobert R,Weston J,Bottou L,et al. Natural language processing(almost)from scratch[J]. Journal of Machine Learning Research,2011,12(1):2493-2537.
    [8] Santos C N D,Gattit M. Deep convolutional neural networks for sentiment analysis of short texts[C]//Proc of International Conference on Computational Linguistics. New York:ACM Press,2014:69-78.
    [9]袁书寒,向阳.词汇语义表示研究综述[J].中文信息学报,2016,30(5):1-8.(Yuan Shuhan,Xiang Yang. A review on lexical semantic representation[J]. Journal of Chinese Information Processing,2016,30(5):1-8.
    [10]Pennington J,Socher R,Manning C. Glove:global vectors for word representation[C]//Proc of Empirical Methods in Natural Language Processing. 2014:1532-1543.
    [11]Deerwester S,Dumais S T,Furnas G W,et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science,1990,41(6):391-407.
    [12]Caron J. Experiments with LSA scoring:optimal rank and basis[J].History of Education Quarterly,2001,50(2):182-203.
    [13] Kevin L,Curt B. Producing high-dimensional semantic spaces from lexical co-occurrence[J]. Behavior Research Methods,Instrumentation,and Computers,1996,28(2):203-208.
    [14]Mikolov T,Chen K,Corrado G,et al. Efficient estimation of word representations in vector space[C]//Proc of International Conference on Machine Learning. New York:ACM Press,2013.
    [15]Mikolov T,Yih S W,Zweig G. Linguistic regularities in continuous space word representations[C]//North American Chapter of the Association for Computational Linguistics. 2013:746-751.
    [16] Levy O,Goldberg Y,Dagan I. Improving distributional similarity with lessons learned from word embeddings[J]. Bulletin De La SociétéBotanique De France,2015,75(3):552-555.
    [17]Levy O,Goldberg Y. Neural word embedding as implicit matrix factorization[C]//Advances in Neural Information Processing Systems.2014:2177-2185.
    [18]Li Yitan,Xu Linli,Tian Fei,et al. Word embedding revisited:a new representation learning and explicit matrix factorization perspective[C]//Proc of International Joint Conference on Artificial Intelligence. Palo Alto,California:AAAI Press,2015:3650-3656.
    [19]Lebret R,Collobert R. Word embeddings through Hellinger PCA[C]//Proc of Conference of the European Chapter of the Association for Computational Linguistics. 2014:482-490.
    [20]Lebret R,Collobert R. Rehabilitation of count-based models for word vector representations[C]//Proc of Conference on Intelligent Text Processing and Computational Linguistics. Berlin:Springer,2015:417-429.
    [21]Turney P D,Pantel P. From frequency to meaning:vector space models of semantics[J]. Journal of Artificial Intelligence Research,2010,37(1):141-188.
    [22]Choi S S,Cha S H,Tappert C C. A survey of binary similarity and distance measures[J]. Journal of Systemics Cybernetics&Informatics,2010,8(1):43-48.
    [23]Bullinaria J A,Levy J P. Extracting semantic representations from word co-occurrence statistics:stop-lists,stemming,and SVD[J].Behavior Research Methods,2012,44(3):890-907.
    [24]sterlund A,dling D,Sahlgren M. Factorization of latent variables in distributional semantic models[C]//Empirical Methods in Natural Language Processing. 2015:227-231.
    [25]Marina M A. Data centering in feature space[C]//Proc of the 9th International Workshop on Artificial Intelligence&Statistics. 2003.
    [26]王裴岩,蔡东风.基于统计检验的核函数度量方法研究[J].计算机科学,2015,42(4):199-205.(Wang Peiyan,Cai Dongfeng.Statistical testing based research on kernel evaluation measures[J].Computer Science,2015,42(4):199-205.)
    [27]Sedgwick P. Spearman’s rank correlation coefficient[J]. British Medical Journal,2014,349(1):g7327.
    [28]Rivlin E. Placing search in context:the concept revisited[J]. ACM Trans on Information Systems,2002,20(1):116-131.
    [29]Luong M,Socher R,Manning C D. Better word representations with recursive neural networks for morphology[C]//Proc of Conference on Computational Natural Language Learning. 2013:104-113.
    [30]裴楠.基于计数模型的word embedding算法研究[D].沈阳:沈阳航空航天大学,2017.(Pei Nan. Word embedding algorithm based on count-based models[D]. Shenyang:Shenyang Aerospace University,2017.)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700