基于公共词块及N-gram模型的问句相似度算法

英文篇名：Question Similarity Algorithm Based on Common Chunks and N-Gram Model
作者：黄贤英 ; 谢晋 ; 龙姝言
英文作者：HUANG Xianying;XIE Jin;LONG Shuyan;College of Computer Science and Engineering,Chongqing University of Technology;
关键词：问句相似度 ; N-gram模型 ; 一元模型 ; 公共词块
英文关键词：question similarity;;N-gram model;;unigram model;;common chunks
中文刊名：CGGL
英文刊名：Journal of Chongqing University of Technology(Natural Science)
机构：重庆理工大学计算机科学与工程学院;
出版日期：2017-10-15
出版单位：重庆理工大学学报(自然科学)
年：2017
期：v.31;No.366
基金：教育部人文社科青年项目(16YJC860010);; 重庆市社会科学规划博士项目(2015BS059)
语种：中文;
页：CGGL201710028
页数：6
CN：10
ISSN：50-1205/T
分类号：181-185+203

摘要

问句相似度算法是问答系统的核心问题,直接影响着问答系统的准确性。针对公共词块算法(CCS)对于中文文本的不适用性,提出一种改进的问句相似度算法(CNS)。该方法结合N-gram模型及公共词块来计算问句向量的相似度,其主要思路是把问句分解成一元模型和二元模型,然后再分析问句之间的公共词块并考虑其顺序结构。实验结果表明:新算法在Top-N条数据集的平均相似度和不同相似度阈值下的准确率均优于常用的问句相似度算法。
Question similarity algorithm is the key problem of QA,which directly affects the accuracy of QA. Aiming at the non applicability of the common chunks similarity algorithm( CCS) to Chinese text,an improved question similarity algorithm( CNS) is proposed,which combines the N-gram model and the common chunks to compute the similarity of the question vectors. The main idea is to break the question into unigram model and bigram model,then to analyze the common chunks between the questions and consider their sequential structure. Experimental results show that the new algorithm is better than the commonly used question similarity algorithms in the average similarity of Top-N data sets and the accuracy of different similarity threshold.

引文

[1]AMIRI H,RESNIK P,BOYD G J,et al.Learning Text Pair Similarity with Context-sensitive Autoencoders[C]//Meeting of the Association for Computational Linguistics.Germany:[s.n.],2016:1882-1892.
    [2]GAIZAUSKAS R,HUMPHREYS K.A Combined IR/NLP Approach to Question Answering Against Large Text Collections[C]//Proceedings of the 6th Content-based Multimedia Information Access(Rl AO-2000).France:[s.n.],2000.
    [3]VOORHEES E.The TREC-8 Question Answering TrackReport[C]//Proceedings of the Eighth Text Retrieval Conference(TREC 2002).USA:[s.n.],2002.
    [4]POONAM G,VISHAI G.A Survey of Text Question Answering Techniques[J],International Journal of Computer Applications,2013,53(4):1-8.
    [5]MATTHEW W B,ERIC N.Improving Text Retrieval Precision and Answer Accuracy in Question Answering Systems[C]//Proceedings of the 2nd workshop on Information Retrieval for Question Answering(Coling 2008),Manchester.UK:[s.n.],2008:1-8.
    [6]徐海洲.自动问答系统中问句相似度计算方法研究[D].南昌:华东交通大学,2014.(下转第197页)
    [7]丁菲菲,杨思春,刘仁金.基于平均信息熵的中文问句关键词提取[J].皖西学院学报,2014(5):46-49.
    [8]李吉月.中文社区问答系统中问题检索技术研究[D].北京:北京理工大学,2016.
    [9]JIANG R,KIM S,BANCHS R E,et al.Towards improving the performance of Vector Space Model for Chinese Frequently Asked Question Answering[C]//International Conference on Asian Language Processing.China:IEEE,2015:136-139.
    [10]黄贤英,刘英涛,饶勤菲.一种基于公共词块的英文短文本相似度算法[J].重庆理工大学学报(自然科学),2015,29(8):88-93.
    [11]SEVERYN A,NICOSIA M,MOSCHITTI A.Learning Semantic Textual Similarity with Structural Representations[C]//51st Annual Meeting of the Association for Computational Linguistics.Bulgaria:[s.n.],2013:714-718.
    [12]于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J].图书情报工作,2004,48(8):48-50.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700