文本相似度计算研究进展综述

英文篇名：A survey on research progress of text similarity calculation
作者：王寒茹 ; 张仰森
英文作者：WANG Hanru;ZHANG Yangsen;Computer School,Beijing Information Science & Technology University;
关键词：距离公式 ; 相似度计算方法 ; 词语相似度 ; 句子相似度 ; 篇章相似度
英文关键词：distance formula;;similarity calculation method;;word similarity;;sentence similarity;;text similarity
中文刊名：BJGY
英文刊名：Journal of Beijing Information Science & Technology University
机构：北京信息科技大学计算机学院;
出版日期：2019-02-15
出版单位：北京信息科技大学学报(自然科学版)
年：2019
期：v.34;No.127
基金：国家自然科学基金项目(61772081)
语种：中文;
页：BJGY201901013
页数：7
CN：01
ISSN：11-5866/N
分类号：71-77

摘要

相似度计算是自然语言处理工作的基石。随着自然语言处理技术的发展,相似度计算的研究价值和应用价值突显。现有的计算方法因其复杂度和精确度的问题,与现实应用的需求并不匹配。针对现有需求,对于不同粒度的文本,研究出一套适合大规模实际应用的相似度计算方法体系迫在眉睫。从方法论的角度,对目前主流的相似度计算方法进行总结,介绍了不同粒度的文本相似度计算的差别以及近几年的研究进展,总结了目前相似度计算方向存在的问题,并对发展趋势进行了展望。
Similarity calculation is the cornerstone of natural language processing. With the development of natural language processing technology, the research value and application value of similarity calculation become more and more important. However, the existing calculation methods do not match the requirements of real-world applications due to their complexity and accuracy. It is urgent to study a set of similarity calculation method system suitable for large-scale practical application for different granularity texts. From the perspective of methodology, this paper firstly expounds the current mainstream similarity calculation method, and then introduces the difference of text similarity calculation with different granularity and the research progress in recent years. Finally it summarizes the problems existing in the current similarity calculation direction and provides an outlook of development.

引文

[1] 车万翔,刘挺,秦兵,等. 基于改进编辑距离的中文相似句子检索[J]. 高技术通讯,2004,14(7):15-19.
    [2] 俞婷婷,徐彭娜,江育娥,等. 基于改进的Jaccard系数文档相似度计算方法[J]. 计算机系统应用,2017,26(12):137-142.
    [3] 李圣文,凌微,龚君芳,等. 一种基于熵的文本相似性计算方法[J]. 计算机应用研究,2016,33(3):665-668.
    [4] 张奇,黄萱菁,吴立德. 一种新的句子相似度度量及其在文本自动摘要中的应用[J]. 中文信息学报,2005,19(2):93-99.
    [5] 华秀丽,朱巧明,李培峰.语义分析与词频统计相结合的中文文本相似度量方法研究[J].计算机应用研究,2012,29(03):833-836.
    [6] Deerwester S,Dumais S T,Furnas G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
    [7] Blei D M,Ng A Y,Jordan M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
    [8] 王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学,2013,40(12):229-232.
    [9] Huang P S,He X,Gao J,et al.Learning deep structured semantic models for web search using clickthrough data[C]//ACM International Conference on Conference on Information & Knowledge Management.ACM,2013:2333-2338.
    [10] He H,Gimpel K,Lin J.Multi-perspective sentence similarity modeling with convolutional neural networks[C]//Conference on Empirical Methods in Natural Language Processing.2015:1576-1586.
    [11] Tai K S,Socher R,Manning C D.Improved semantic representations from tree-structured long short-term memory networks[J].Computer Science,2015,5(1):36.
    [12] Mueller J,Thyagarajan A.Siamese recurrent architectures for learning sentence similarity[C]//Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:2786-2792.
    [13] Neculoiu P,Versteegh M,Rotaru M.Learning text similarity with siamese recurrent networks[C]//Proceedings of the 1st Workshop on Representation Learning for NLP.2016:148-157.
    [14] Kiros R,Zhu Y,Salakhutdinov R,et al.Skip-thought vectors[C]//Advances in neural information processing systems.2015:3294-3302.
    [15] Arora S,Liang Y,Ma T.A simple but tough-to-beat baseline for sentence embeddings[C]//In International Conference for Learning Representations.ICLR 2017.
    [16] Kusner M J,Sun Y,Kolkin N I,et al.From word embeddings to document distances[C]//International Conference on International Conference on Machine Learning.JMLR.org,2015:957-966.
    [17] 肖和,付丽娜,姬东鸿.神经网络与组合语义在文本相似度中的应用[J].计算机工程与应用,2016,52(07):139-142.
    [18] 刘群,李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76.
    [19] 吴健,吴朝晖,李莹,等.基于本体论和词汇语义相似度的Web服务发现[J].计算机学报,2005(04):595-602.
    [20] Rada R,Mili H,Bicknell E,et al.Development and application of a metric on semantic nets[J].IEEE Transactions on Systems,Man and Cybern,1989,19(1):17-30.
    [21] Leacock C,Miller G A,Chodorow M.Using corpus statistics and WordNet relations for sense identification[J].Journal of Computational Linguistics,1998,24(1):147-165.
    [22] Wu Z,Palmer M.Verbs semantics and lexical selection[C]//Meeting on Association for Computational Linguistics.ACL,1994:133-138.
    [23] Hirst G,St-Onge D.Lexical chains as representations of context for the detection and correction of malapropisms[J].WordNet:An electronic lexical database,1998:305-332.
    [24] Kim J W.CP/CV:concept similarity mining without frequency information from domain describing taxonomies[C]//ACM International Conference on Information and Knowledge Management.ACM,2006:483-492.
    [25] Resnik,P.Using information content to evaluate semantic similarity in a taxonomy[C]//International Joint Conference on Artificial Intelligence.1995:448-453.
    [26] Lin D.An information-theoretic definition of similarity[C]//International Conference on Machine Learning.1998:296-304.
    [27] Jiang J J,Conrath D W.Semantic similarity based on corpus statistics and lexical taxonomy[J].ROCLING,1997:11512-0.
    [28] Pedersen T,Patwardhan S,Michelizzi J.WordNet:similarity - measuring the relatedness of concepts[C]//National Conference on Artifical Intelligence.AAAI Press,2004:1024-1025.
    [29] Banerjee S,Pedersen T.An adapted lesk algorithm for word sense disambiguation using WordNet[C]//International Conference on Computational Linguistics and Intelligent Text Processing.Springer-Verlag,2002:136-145.
    [30] Li Y,Bandar Z A,Mclean D.An approach for measuring semantic similarity between words using multiple information sources[J].Knowledge & Data Engineering IEEE Transactions on,2003,15(4):871-882.
    [31] Shi B,Yan J Z,Wang P,et al.Ontology-based measure of semantic similarity between concepts[J].Computer Engineering,2009,2(19):109-112.
    [32] 陈宏朝,李飞,朱新华,等.基于路径与深度的同义词词林词语相似度计算[J].中文信息学报,2016,30(05):80-88.
    [33] 彭琦,朱新华,陈意山,等.基于信息内容的词林词语相似度计算[J].计算机应用研究,2018,35(02):400-404.
    [34] Strube M,Ponzetto S P.WikiRelate! computing semantic relatedness using wikipedia[C]//National Conference on Artificial Intelligence.2006:1419-1424.
    [35] Gabrilovich E,Markovitch S.Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]//Proc.International Joint Conference on Artificial Intelligence.2016:1606-1611.
    [36] Milne D.Computing semantic relatedness using wikipedia link structure[C]//Proceedings of the New Zealand Computer Science Research Student Conference.NZ CSRSC’07,2008.
    [37] 詹志建,梁丽娜,杨小平.基于百度百科的词语相似度计算[J].计算机科学,2013,40(6):199-202.
    [38] 尹坤,尹红风,杨燕,等.基于SimRank的百度百科词条语义相似度计算[J].山东大学学报,2014,44(03):29-35.
    [39] 穗志方,俞士汶.基于骨架依存树的语句相似度计算模型[C]//1998中文信息处理国际会议.1998.
    [40] 田堃,柯永红,穗志方.基于语义角色标注的汉语句子相似度算法[J].中文信息学报,2016,30(06):126-132.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700