一种基于字向量和LSTM的句子相似度计算方法

英文篇名：A Method of Sentence Similarity Calculation Based on Word Vector and LSTM
作者：何颖刚 ; 王宇
英文作者：He Yinggang;Wang Yu;Chengyi College,Jimei University;
关键词：句子相似度 ; 字向量 ; Word2Vec ; LSTM神经网络
英文关键词：semantic similarity;;character vector;;Word2Vec;;LSTM neural network
中文刊名：CJDL
英文刊名：Journal of Yangtze University(Natural Science Edition)
机构：集美大学诚毅学院;
出版日期：2019-01-25
出版单位：长江大学学报(自然科学版)
年：2019
期：v.16;No.245
基金：福建省中青年教师教育科研项目(JAS151390);; 福建省高等学校创新创业教育改革项目(C16052);; 集美大学诚毅学院青年科研基金项目(CK17067)
语种：中文;
页：CJDL201901017
页数：8
CN：01
ISSN：42-1741/N
分类号：9+98-104

摘要

句子相似度度量在自然语言处理领域中有着广泛的应用。针对现有的句子相似度计算方法不能充分捕捉句子的语义结构特征信息的问题,提出一种基于字向量和LSTM (long-short term memory,长短期记忆)网络的句子相度计算方法。首先,通过Word2Vec模型对中文维基百科语料进行训练,得到中文字向量词典;然后根据字向量词典将句子映射为句向量,并输入LSTM网络,获得句子的特征向量;最后,通过相似度算法计算2个句子特征向量之间的相似度。通过在2个数据集上的试验结果表明,该方法能够提高句子相似度计算的准确性,效果好于传统的语句相似度计算方法和基于词向量的相似度计算方法。
Sentence similarity measure was widely applied in the field of natural language processing.In order to solve the problem that the existing sentence similarity calculation methods could not fully capture the semantic structure feature information of the sentence,a method for calculating sentence phase based on word vector and LSTM(long-short term memory)network was proposed.Firstly,the Chinese character vector dictionary was obtained by training the Chinese Wikipedia corpus by Word2 Vec model,then the sentence was mapped to sentence vector according to the word vector dictionary,and the feature vector of sentence was obtained by input LSTM network.Finally,the similarity between two sentence feature vectors was calculated by similarity algorithm.The experimental results on two datasets show that this method can improve the accuracy of sentence similarity calculation.

引文

[1]赵臻,吴宁,宋盼盼.基于多特征融合的句子语义相似度计算[J].Computer Engineering,2012,38(1):171~173.
    [2]冯凯,王小华,堪志群.基于动态规划的汉语句子相似度算法[J].计算机工程,2013,39(2):220~224.
    [3]闫红,李付学,周云.基于HowNet句子相似度的计算[J].计算机技术与发展,2015,25(11):53~57.
    [4]朱新华,马润聪,孙柳等.基于知网与词林的词语语义相似度计算[J].中文信息学报,2016,30(4):29~36.
    [5]黄贤英,张金鹏,刘英涛,等.基于词项语义映射的短文本相似度算法[J].计算机工程与设计,2015,36(6):1514~1518,1534.
    [6]李晓,解辉,李立杰.基于Word2vec的句子语义相似度计算研究[J].计算机科学,2017,44(9):256~260.
    [7]郭胜国,邢丹丹.基于词向量的句子相似度计算及其应用研究[J].现代电子技术,2016,39(13):99~102.
    [8]周永梅,陶红,陈姣姣,等.自动问答系统中的句子相似度算法的研究[J].Computer Technology and development,2012,22(5):75~78.
    [9]翟继友.一种混合型的句子语义相似度计算方法[J].科学技术与工程,2014,14(28):81~85.
    [10]宋冬云,郑瑾,张祖平.基于混合策略的中文短文本相似度计算[J].计算机工程与应用,2018,54(12):116~120.
    [11]李伟康,李炜,吴云芳.深度学习中汉语字向量和词向量结合方式探究[J].中文信息学报,2017,31(6):140~146.
    [12]Bengio Y,Ducharme R,Vincent P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137~1155.
    [13]Hinton G E.Learning distributed representations of concepts[A].Proceedings of the Eighth Annual Conference of the Cognitive Science Society[C].Amherst,1986.
    [14]Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv preprint,2013,arX-iv:1301~3781[cs.CL].
    [15]Hocheriter S,Schmidhuber J.Long short-term memory[J].Neural Computation,1997,9(8):1735~1780.