基于LDA的社科文献主题建模方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:A Topic Modeling Method for Social Science Literature Based on LDA
  • 作者:李昌亚 ; 刘方方
  • 英文作者:LI Chang-ya;LIU Fang-fang;School of Computer Engineering and Science,Shanghai University;
  • 关键词:主题模型 ; LDA ; 社科文献 ; Gibbs抽样
  • 英文关键词:topic model;;LDA;;social science literature;;Gibbs sampling
  • 中文刊名:WJFZ
  • 英文刊名:Computer Technology and Development
  • 机构:上海大学计算机工程与科学学院;
  • 出版日期:2017-11-15 11:28
  • 出版单位:计算机技术与发展
  • 年:2018
  • 期:v.28;No.250
  • 基金:上海市科委自然基金(12zr1411000)
  • 语种:中文;
  • 页:WJFZ201802039
  • 页数:6
  • CN:02
  • ISSN:61-1450/TP
  • 分类号:188-193
摘要
随着互联网的发展,文本分类和主题提取的应用越来越广泛,而主题模型在文本主题提取中起着很大的作用。LDA(latent Dirichlet allocation)模型是一种应用非常广泛且很成熟的主题模型,也是一个概率生成模型,可以很好地解决多词一义和一词多义的问题。但是当利用LDA模型对社科文献领域类的文档集进行主题建模时,由于该建模方法忽略了文档集自身的主题特点,提取的主题分布是偏向文档中高频词汇,所以造成最后提取的主题偏离文档的本质意义上的主题、结果不够准确。针对LDA模型对文档进行主题建模的过程,结合社科文献领域的文档特点,对主题建模的过程进行相应的改进,提出一种新的主题建模方法,从而使最终提取的主题更加准确,更符合文档集本身的主题特点。
        With the development of the Internet, the application of text classification and topic extraction is becoming more and more widely,and topic model plays a critical role in topic extraction of the text.LDA( latent Dirichlet allocation), as an extensive and mature topic model,is also a probability generation model,which can solve the problem of synonym and polysemy.But when LDA model is used to model the document collection in the domain of social science literature,because of its ignorance of the topic characteristics of document collection itself,the topic distribution extracted by the modeling method is to trend the high frequency words,which makes the extracted topic deviated from the document topic in nature and the results inaccurate.In this paper, aiming at the topic modeling of document with LDA model and combined with the characteristics of the document in the field of social literature,we present a new topic modeling method to improve accordingly the process of modeling, so that the topic of the final extraction is more accurate and more consistent with the topic characteristics of the document collection itself.
引文
[1]王昱.社科文献的特点、作用及省级社科文献资源建设[J].青海社会科学,1994(6):83-89.
    [2]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the Association for Information Science and Technology,1990,41(6):391-407.
    [3]DAN O.Probabilistic latent semantic analysis[C]//Proceedings of uncertainty in artificial intelligence.[s.l.]:[s.n.],1999:289-296.
    [4]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
    [5]张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590.
    [6]高明,金澈清,钱卫宁,等.面向微博系统的实时个性化推荐[J].计算机学报,2014,37(4):963-975.
    [7]YANO T,COHEN W W,SMITH N A.Predicting response to political blog posts with topic models[C]//Human language technologies:the 2009 conference of the north american chapter of the association for computational linguistics.[s.l.]:Association for Computational Linguistics,2009:477-485.
    [8]张晓艳,王挺,梁晓波.LDA模型在话题追踪中的应用[J].计算机科学,2011,38(10A):136-139.
    [9]何锦群.LDA在信息检索中的应用研究[D].天津:天津理工大学,2014.
    [10]余维军,刘子平,杨卫芳.基于改进LDA主题模型的产品特征抽取[J].计算机与现代化,2016(11):1-6.
    [11]彭云,万常选,江腾蛟,等.基于语义约束LDA的商品特征和情感词提取[J].软件学报,2017,28(3):676-693.
    [12]胡勇军,江嘉欣,常会友.基于LDA高频词扩展的中文短文本分类[J].现代图书情报技术,2013(6):42-48.
    [13]张小平,周雪忠,黄厚宽,等.一种改进的LDA主题模型[J].北京交通大学学报:自然科学版,2010,34(2):111-114.
    [14]施乾坤.基于LDA模型的文本主题挖掘和文本静态可视化的研究[D].南宁:广西大学,2013.
    [15]倪丽萍,刘小军,马驰宇.基于LDA模型和AP聚类的主题演化分析[J].计算机技术与发展,2016,26(12):6-11.
    [16]WALLACH H.Topic modeling:beyond bag of words[C]//Proceedings of the 23rd international conference on machine learning.Pittsburgh,Pennsylvania:[s.n.],2006.
    [17]WEI Xing,CROFT W B.LDA-based document models for Ad-hoc retrieval[C]//Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval.New York:ACM,2006:178-185.
    [18]NEVADA L V.Fast collapsed Gibbs sampling for latent Dirichlet allocation[C]//Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining.New York,USA:ACM,2008:569-577.
    [19]SALTON G.Introduction to modern information retrieval[M].New York:McGraw-Hill Book Company,1983.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700