融合Wikipedia分类结构及显式语义特征的短文本检索

英文篇名：Short text retrieval combining Wikipedia taxonomy and explicit semantic features
作者：李璞 ; 张志锋 ; 杨百冰 ; 肖宝 ; 蒋运承
英文作者：LI Pu;ZHANG Zhifeng;YANG Baibing;XIAO Bao;JIANG Yuncheng;Software Engineering College,Zhengzhou University of Light Industry;School of Electronics and Information Engineering,Beibu Gulf University;School of Computer Science,South China Normal University;
关键词：Wikipedia分类结构 ; 显式语义特征 ; 特征选择 ; 短文本 ; 信息检索
英文关键词：Wikipedia taxonomy;;explicit semantic feature;;feature selection;;short text;;information retrieval
中文刊名：NNXB
英文刊名：Journal of Henan Agricultural University
机构：郑州轻工业大学软件学院;北部湾大学电子与信息工程学院;华南师范大学计算机学院;
出版日期：2019-04-15
出版单位：河南农业大学学报
年：2019
期：v.53;No.212
基金：国家自然科学基金青年科学基金项目(61802352);国家自然科学基金面上项目(61772210);; 郑州轻工业大学博士科研基金资助项目(0215/13501050015);; 广西高校中青年教师科研基础能力提升项目(2019KY046);; 钦州市科学研究与技术开发计划项目(20189903);; 广州市科技计划项目(2014J4100031)
语种：中文;
页：NNXB201902016
页数：9
CN：02
ISSN：41-1112/S
分类号：100-108

摘要

针对网络信息空间出现的大量短文本具有长度短、信息量少、特征稀疏、语法不规则等特点,传统信息检索技术无法有效地对其进行处理的问题,本研究以语义关联度为出发点,基于当前主流的语义知识源Wikipedia来研究短文本检索技术。根据Wikipedia页面中包含的分类结构信息,提出一种显式语义特征选择及关联度计算方法。在此基础上,提出一种低维显式语义空间下的短文本检索方法,并通过实验测试验证了该方法的可行性和有效性。研究结果表明,本研究与当前基于图论的方法和基于链接的方法相比,分别在评估指标MAP上提高了6%和4. 1%,在P@30上提高了10. 4%和5. 8%,在R-Prec上提高了6. 1%和3%。
Considering the short length,little information,sparse features and irregular grammar of the large number of short text data appeared in the Web information space,traditional information retrieval technology cannot deal with short text effectively. In view of the above problems,in this research the semantic relatedness is taken as the starting point. The short text retrieval technology based on the current mainstream semantic knowledge source Wikipedia is studied. According to the taxonomy information contained in Wikipedia pages,an explicit semantic feature selection and relatedness computation method are proposed. On this basis,a short text retrieval method under low dimensional explicit semantic space is proposed. Finally,the feasibility and effectiveness of the method are verified by experimental tests. The results showed that,compared with the graph-based and link-based methods,this research improves MAP by 6% and 4. 1%,P@ 30 by 10. 4% and 5. 8%,R-Prec by 6. 1% and3%,respectively.

引文

[1]LI P,XIAO B,MA W J,et al.A graph-based semantic relatedness assessment method combining wikipedia features[J].Engineering Applications of Artificial Intelligence,2017,65:268-281.
    [2]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the Association for Information Science and Technology,1990,41(6):391-407.
    [3]LUND K,BURGESS C.Producing high-dimensional semantic spaces from lexical co-occurrence[J].Behavior Research Methods,1996,28(2):203-208.
    [4]BENGIO Y,SCHWENK H,SEN CAL J S,et al.Neural probabilistic language models[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
    [5]LE Q V,MIKOLOV T.Distributed representations of sentences and documents[J].Computer Science,2014,4:1188-1196.
    [6]ZHANG H,ZHONG G.Improving short text classification by learning vector representations of both words and hidden topics[J].Knowledge-Based Systems,2016,102:76-86.
    [7]DING C H Q.A probabilistic model for latent semantic indexing[J].Journal of the Association for Information Science&Technology,2010,56(6):597-608.
    [8]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
    [9]GABRILOVICH E,MARKOVITCH S.Wikipedia-based semantic interpretation for natural language processing[J].Journal of Artificial Intelligence Research,2009,34(4):443-498.
    [10]HUANG H,WANG Y,CHONG F,et al.Leveraging conceptualization for short-text embedding[J].IEEE Transactions on Knowledge&Data Engineering,2018,99:1-12.
    [11]SHEN D,PAN R,SUN J T,et al.Query enrichment for web-query classification[J].ACM Transactions on Information Systems,2006,24(3):320-352.
    [12]王仲远,程健鹏,王海勋,等.短文本理解研究[J].计算机研究与发展,2016,53(2):262-269.
    [13]VO D T,OCK C Y.Learning to classify short text from scientific documents using topic models with various types of knowledge[J].Expert Systems with Applications,2015,42(3):1684-1698.
    [14]GAO L,ZHOU S,GUAN J.Effectively classifying short texts by structured sparse representation with dictionary filtering[J].Information Sciences,2015,323:130-142.
    [15]YU Z,WANG H,LIN X,et al.Understanding short texts through semantic enrichment and hashing[J].IEEETransactions on Knowledge&Data Engineering,2016,28(2):566-579.
    [16]肖宝,李璞,胡娇娇,等.基于潜在语义与图结构的微博语义检索[J].计算机工程,2017,43(6):182-188.
    [17]TAIEB M A H,AOUICHA M B,HAMADOU A B.Computing semantic relatedness using Wikipedia features[J].Knowledge-Based Systems,2013,50(50):260-278.
    [18]SALTON G,MCGILL M J.Introduction to modern information retrieval[M].New York:McGrawp Hill,1986.
    [19]PORTER M F.An algoritm for suffix stripping[J].Program Electronic Library&Information Systems,2006,14(3):130-137.
    [20]KALLOUBI F,NFAOUI E H,BEQQALI O E.Microblog semantic context retrieval system based on linked open data and graph-based theory[J].Expert Systems with Applications,2016,53:138-148.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700