摘要
针对网络信息空间出现的大量短文本具有长度短、信息量少、特征稀疏、语法不规则等特点,传统信息检索技术无法有效地对其进行处理的问题,本研究以语义关联度为出发点,基于当前主流的语义知识源Wikipedia来研究短文本检索技术。根据Wikipedia页面中包含的分类结构信息,提出一种显式语义特征选择及关联度计算方法。在此基础上,提出一种低维显式语义空间下的短文本检索方法,并通过实验测试验证了该方法的可行性和有效性。研究结果表明,本研究与当前基于图论的方法和基于链接的方法相比,分别在评估指标MAP上提高了6%和4. 1%,在P@30上提高了10. 4%和5. 8%,在R-Prec上提高了6. 1%和3%。
Considering the short length,little information,sparse features and irregular grammar of the large number of short text data appeared in the Web information space,traditional information retrieval technology cannot deal with short text effectively. In view of the above problems,in this research the semantic relatedness is taken as the starting point. The short text retrieval technology based on the current mainstream semantic knowledge source Wikipedia is studied. According to the taxonomy information contained in Wikipedia pages,an explicit semantic feature selection and relatedness computation method are proposed. On this basis,a short text retrieval method under low dimensional explicit semantic space is proposed. Finally,the feasibility and effectiveness of the method are verified by experimental tests. The results showed that,compared with the graph-based and link-based methods,this research improves MAP by 6% and 4. 1%,P@ 30 by 10. 4% and 5. 8%,R-Prec by 6. 1% and3%,respectively.
引文
[1]LI P,XIAO B,MA W J,et al.A graph-based semantic relatedness assessment method combining wikipedia features[J].Engineering Applications of Artificial Intelligence,2017,65:268-281.
[2]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the Association for Information Science and Technology,1990,41(6):391-407.
[3]LUND K,BURGESS C.Producing high-dimensional semantic spaces from lexical co-occurrence[J].Behavior Research Methods,1996,28(2):203-208.
[4]BENGIO Y,SCHWENK H,SEN CAL J S,et al.Neural probabilistic language models[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[5]LE Q V,MIKOLOV T.Distributed representations of sentences and documents[J].Computer Science,2014,4:1188-1196.
[6]ZHANG H,ZHONG G.Improving short text classification by learning vector representations of both words and hidden topics[J].Knowledge-Based Systems,2016,102:76-86.
[7]DING C H Q.A probabilistic model for latent semantic indexing[J].Journal of the Association for Information Science&Technology,2010,56(6):597-608.
[8]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[9]GABRILOVICH E,MARKOVITCH S.Wikipedia-based semantic interpretation for natural language processing[J].Journal of Artificial Intelligence Research,2009,34(4):443-498.
[10]HUANG H,WANG Y,CHONG F,et al.Leveraging conceptualization for short-text embedding[J].IEEE Transactions on Knowledge&Data Engineering,2018,99:1-12.
[11]SHEN D,PAN R,SUN J T,et al.Query enrichment for web-query classification[J].ACM Transactions on Information Systems,2006,24(3):320-352.
[12]王仲远,程健鹏,王海勋,等.短文本理解研究[J].计算机研究与发展,2016,53(2):262-269.
[13]VO D T,OCK C Y.Learning to classify short text from scientific documents using topic models with various types of knowledge[J].Expert Systems with Applications,2015,42(3):1684-1698.
[14]GAO L,ZHOU S,GUAN J.Effectively classifying short texts by structured sparse representation with dictionary filtering[J].Information Sciences,2015,323:130-142.
[15]YU Z,WANG H,LIN X,et al.Understanding short texts through semantic enrichment and hashing[J].IEEETransactions on Knowledge&Data Engineering,2016,28(2):566-579.
[16]肖宝,李璞,胡娇娇,等.基于潜在语义与图结构的微博语义检索[J].计算机工程,2017,43(6):182-188.
[17]TAIEB M A H,AOUICHA M B,HAMADOU A B.Computing semantic relatedness using Wikipedia features[J].Knowledge-Based Systems,2013,50(50):260-278.
[18]SALTON G,MCGILL M J.Introduction to modern information retrieval[M].New York:McGrawp Hill,1986.
[19]PORTER M F.An algoritm for suffix stripping[J].Program Electronic Library&Information Systems,2006,14(3):130-137.
[20]KALLOUBI F,NFAOUI E H,BEQQALI O E.Microblog semantic context retrieval system based on linked open data and graph-based theory[J].Expert Systems with Applications,2016,53:138-148.