基于词向量的Jaccard相似度算法

英文篇名：Jaccard Text Similarity Algorithm Based on Word Embedding
作者：田星 ; 郑瑾 ; 张祖平
英文作者：TIAN Xing;ZHENG Jin;ZHANG Zu-ping;School of Information Science and Engineering,Central South University;
关键词：词向量 ; Jaccard算法 ; 句子相似度
英文关键词：Word embedding;;Jaccard algorithm;;Text similarity
中文刊名：JSJA
英文刊名：Computer Science
机构：中南大学信息科学与工程学院;
出版日期：2018-07-15
出版单位：计算机科学
年：2018
期：v.45
基金：国家自然科学基金(61379109)资助
语种：中文;
页：JSJA201807032
页数：4
CN：07
ISSN：50-1075/TP
分类号：192-195

摘要

通过对传统Jaccard算法的研究和改进,提出了一种基于词向量的Jaccard句子相似度算法。传统的Jaccard算法以句子的字面量为特征,因而在语义层面的相似度计算方面受到了一定的限制。而随着深度学习的兴起,尤其是词向量的提出,词语在计算机中的表示有了突破性的进展。该算法首先通过训练将每个词语映射为语义层面的高维向量,然后计算各个词向量之间的相似度,高于阈值α的作为共现部分,最终计算句子的相似度。实验表明,相较于传统的Jaccard算法,该算法在短文本相似度计算的准确率上有较明显的提升。
Based on the research and improvement of the traditional Jaccard algorithm,this paper proposed a Jaccard sentence similarity algorithm based on word embedding.Traditional Jaccard algorithm is characterized by literals of the sentence,so it is restricted in the respect of semantic similarity calculation.While with the rapid development of deep learning,especially the proposal of word embedding,there is a breakthrough on the expression of words in computer.This algorithm firstly maps each word into a high-dimensional vector on semantic level by training,and then calculates the similarity between the respective word vector.The results which are higher than the thresholdαare regarded as the intersection,and finally the sentence similarity is calculated.Experiments show that the algorithm significantly improves the accuracy of short text similarity calculation comparing with traditional Jaccard algorithm.

引文

[1]ACHANANUPARP P,HU X,SHEN X.The Evaluation of Sentence Similarity Measures[C]∥International Conference on Data Warehousing and Knowledge Discovery.2008:305-316.
    [2]METZLER D,DUMAIS S,MEEK C.Similarity Measures for Short Segments of Text[C]∥Advances in Information Retrieval,European Conference on Ir Research(ECIR 2007).Rome,Italy,2007:16-27.
    [3]LI Y,MCLEAN D,BANDAR Z A,et al.Sentence Similarity Based on Semantic Nets and Corpus Statistics[J].IEEE Transactions on Knowledge&Data Engineering,2006,18(8):1138-1150.
    [4]AGIRRE E,ALFONSECA E,LACALLE O L D.Approximating hierarchy-based similarity for WordNet nominal synsets using topic signatures[C]∥Proceedings of Gwc.2004.
    [5]ZHANG H J,WANG G S,ZHONG Y X.Text Similarity Computing Based on Hamming Distance[J].Computer Engineering and Applications,2001,37(19):21-22.(in Chinese)张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22.
    [6]GUO Q L,LI Y M,TANG Q.Similarity computing of documents based on VSM[J].Application Research of Computers,2008,25(11):3256-3258.(in Chinese)郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008,25(11):3256-3258.
    [7]LIAO K J,YANG B B.Similarity Computing of Documents Based on Weighted Semantic Network[J].Journal of Intelligence,2012,31(7):182-186.(in Chinese)廖开际,杨彬彬.基于加权语义网的文本相似度计算的研究[J].情报杂志,2012,31(7):182-186.
    [8]LIAO Z F,QIU L X,XIE Y S,et al.A Frequency Enhanced Algorithm of Sentence Semantic Similarity[J].Journal of Hunan University(Natural Sciences),2013,40(2):82-88.(in Chinese)廖志芳,邱丽霞,谢岳山,等.一种频率增强的语句语义相似度计算[J].湖南大学学报(自然科学版),2013,40(2):82-88.
    [9]LIAO Z F,ZHOU G E,LI J F,et al.A Chinese Short Text Similarity Algorithm Based on Semantic and Syntax[J].Journal of Hunan University(Natural Sciences),2016,43(2):135-140.(in Chinese)廖志芳,周国恩,李俊锋,等.中文短文本语法语义相似度算法[J].湖南大学学报(自然科学版),2016,43(2):135-140.
    [10]BENGIO Y,SCHWENK H,SENECAL J S,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
    [11]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural Language Processing(almost)from Scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
    [12]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
    [13]HUANG E H,SOCHER R,MANNING C D,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.2012:873-882.
    [14]NG J P,ABRECHT V.Better Summarization Evaluation with Word Embeddings for ROUGE[C]∥Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing.2015.
    [15]KUSNER M J,SUN Y,KOLKIN N I,et al.From Word Embeddings to Document Distances[C]∥International Conference on Mechine Learning.2015:957-966.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700