一种基于词向量的模糊查询扩展方法

英文篇名：Query Expansion Based on Word Embedding in Fuzzy Document Retrieval
作者：陈淑巧 ; 邱东 ; 江海欢
英文作者：CHEN Shuqiao;QIU Dong;JIANG Haihuan;College of Mathematics and Physics,Chongqing University of Posts and Telecommunications;
关键词：词向量 ; 模糊查询项扩展 ; 信息检索
英文关键词：word embedding;;fuzzy query expansion;;information retrieval
中文刊名：SCSD
英文刊名：Journal of Sichuan Normal University(Natural Science)
机构：重庆邮电大学理学院;
出版日期：2019-01-11
出版单位：四川师范大学学报(自然科学版)
年：2019
期：v.42
基金：国家自然科学基金(11671001和61472056)
语种：中文;
页：SCSD201901015
页数：6
CN：01
ISSN：51-1295/N
分类号：96-101

摘要

在中文文本信息中,同一个语义往往有多种不同的表达方法,不同的个体对同一个词语理解也会有一定的偏差,这将导致在信息检索时,出现查询项与检索数据"词不匹配"的问题.虽然,模糊检索是改善这一问题的有效方法之一,但仅仅利用已知信息进行模糊检索,已不能满足充斥着大规模无标定文本信息的网络时代的检索需要.提出一个基于词向量的模糊检索查询扩展方法,通过词向量计算查询项的相似词,进而进行查询项扩展.相比与传统的模糊检索方法,在同一测试集中,基于词向量的模糊查询扩展方法测评出的查全率、查准率以及两者的调和平均数均得到了有效提升.
There are different ways to express the same word sense in Chinese.When different individuals learn and understand the same words,deviations will appear. This results in term mismatch between queries and documents. A fuzzy document retrieval system is one of the effective method to solve the problem. However,it can not achieve satisfying results,when we deal with large-scale unmarked data. An approach to query expansion based on word embedding in fuzzy document retrieval is proposed to settle the issue in this paper. The word embedding,being trained in a large number of corpus with the continuous bag-of-words model,is used to gain the similar word,and then the fuzzy query is expanded. Compared with the traditional fuzzy retrieval method,the recall ratio,precision ratio and the harmonic average of them are all increased.

引文

[1]王知津,郑红军.基于集合理论的信息检索模型[J].情报科学,2004,22(11):1288-1291.
    [2]刘树林.基于领域本体信息检索的研究及其实现[D].长春:东北师范大学,2009.
    [3]YASUSHI O W,TETSUYA M T,KIYOHIKO K. A fuzzy document retrieval system using the keyword connection matrix and a learning method[J]. Fuzzy Sets and Systems,1991,39(2):163-179.
    [4]MANDALA R,TOKUNAGA T,TANAKA H. Query expansion using heterogeneous thesauri[J]. Information Processing and Management,2000,36(3):361-378.
    [5]马晖男,吴江宁,潘东华.一种基于同义词词典的模糊查询扩展方法[J].大连理工大学学报,2007,47(3):439-443.
    [6]LIU Z,CHEN J,LI X,et al. Design and application for the model of semantic query expansion based on domain ontology[J]. International J Modelling Identification and Control,2012,16(3):277-284.
    [7]BENGIO Y,DUCHARME R,VINCENT P,et al. A neural probabilistic language model[J]. Machine Learning Research,2003,3(2):1137-1155.
    [8]MNIH A,GEOFFREY H. Three new graphical models for statistical language modelling[C]//Proceedings of the 24th international conference on Machine learning. Corvalis,Oregon:ACM,2007:641-648.
    [9]MIKOLOV T,CHEN K,CORRADO G,et al. Efficient estimation of word representations in vector space[C]//Proceedings of Workshop at International Conference on Learning Representations. Scottsdale,Arizona:ICLR,2013:1301-1378.
    [10]叶光辉.基于词词关联矩阵改进的模糊检索研究[D].武汉:华中师范大学,2013.
    [11]MIKOLOV T,SUTSKEVER I,CHEN K,et al. Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013,26(3):3111-3119.
    [12]COLLOBERT R,JASON W. A unified architecture for natural language processing:Deep neural networks with multitask learning[C]//Proceedings of the 25th international conference on Machine learning. Helsinki,Finland:ACM,2008:160-167.
    [13]MNIH A,KAVUKCUOGLU K. Learning word embeddings efficiently with noise-contrastive estimation[J]. Advances in Neural Information Processing Systems,2013,28(2):2265-2273.
    [14]刘欣,席耀一,王波,等. Word Net和词向量相结合的句子检索方法[J].信息工程大学学报,2017,12(4):486-491.
    [15]邹益民,张智雄.网络科技信息情报价值评价方法综述[J].情报杂志,2014,33(5):25-30.
    [16]黄敬瑜.三维模型精确测地线及其若干应用[D].桂林:广西师范大学,2013.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700