一种基于SA＿LDA模型的文本相似度计算方法

英文篇名：Text Similarity Calculation Algorithm Based on SA＿LDA Model
作者：邱先标 ; 陈笑蓉
英文作者：QIU Xian-biao;CHEN Xiao-rong;College of Computer Science and Technology,Guizhou University;
关键词：文本相似度 ; SA＿LDA模型 ; 主题模型 ; 文本挖掘
英文关键词：Text similarity;;SA_LDA model;;Topic model;;Text mining
中文刊名：JSJA
英文刊名：Computer Science
机构：贵州大学计算机科学与技术学院;
出版日期：2018-06-15
出版单位：计算机科学
年：2018
期：v.45
基金：国家自然科学基金(61363028)资助
语种：中文;
页：JSJA2018S1022
页数：5
CN：S1
ISSN：50-1075/TP
分类号：119-122+152

摘要

计算文本的相似度是许多文本信息处理技术的基础。然而,常用的基于向量空间模型(VSM)的相似度计算方法存在着高维稀疏和语义敏感度较差等问题,因此相似度计算的效果并不理想。在传统的LDA(Latent Dirichlet Allocation)模型的基础上,针对其需要人工确定主题数目的问题,提出了一种能通过模型自身迭代确定主题个数的自适应LDA(SA_LDA)模型。然后,将其引入文本的相似度计算中,在一定程度上解决了高维稀疏等问题。通过实验表明,该方法能自动确定模型主题的个数,并且利用该模型计算文本相似度时取得了比VSM模型更高的准确度。
Many information processing techniques are based on computing the similarity of text.However,the traditional method of similarity calculation based on vector space model has the problems of high dimension and poor semantic sensitivity,so the performance is not very satisfactory.This paper proposed a self-adaptive LDA(SA_LDA)model based on traditional LDA model.It can manually determine the number of topic.Applying it in text similarity calculation,it can solve the high dimensional and sparse problem.Experiments show that this method improves the accuracy of similarity calculation and the effect of text clustering compared with VSM.

引文

[1]XU L,SUN S,WANG Q.Text similarity algorithm based on semantic vector space model[C]∥15th IEEE/ACIS International Conference on Computer and Information Science.2016.
    [2]FAN Z X,CHEN S Y,ZHA L,et al.A Text Clustering Approach of Chinese News Based on Neural Network Language Model[J].International Journal of Parallel Programming,2016,44(1):198-206.
    [3]CAO Q M,GUO Q,WANG Y L,et.al.Text clustering using VSM with feature clusters[J].Neural Computing&Applications,2015,26(4):995-1003.
    [4]GUO L T,LI Y,MU D J,et al.A LDA model based topic detection method[J].Journal of Northwestern Polytechnical University,2016,34(4):98-102.
    [5]王刚,钟国祥.一种基于本体相似度计算的文本聚类研究[J].计算机科学,2010,37(9):222-224.
    [6]HU X H,ZHANG X,et al.Exploiting Wikipedia as External Knowledge for Document Clustering[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Paris,France,2009:389-396.
    [7]HUANG A,MILNE D,FRANK E,et al.Clustering Documents using a Wikipedia Based Concept Representation[M]∥Advanced in Knowledge Discovery and Data Mining.Spring Berlin Heidelberg,2009:628-636.
    [8]HOFMANN T.Probabilistic latent semantic indexing[C]∥22nd International ACMSIGIR Conference on Research and Development in Information Retrieval.Berkeley,CA,USA,1999:50-57.
    [9]BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
    [10]徐戈,黄厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1437.
    [11]曹娟,张勇东.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,31(10):1780-1788.
    [12]TEH Y,JORDAN M,BEAL M,et al.Hierarchical diriehht processes[J].Journal of the American Statistical Association,2007,101(476):1566-1581.
    [13]张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377.
    [14]黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700