用户名: 密码: 验证码:
基于双词语义扩展的Biterm主题模型
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Biterm Topic Model Based on Semantic Extension of Double Words
  • 作者:李思宇 ; 谢珺 ; 邹雪君 ; 续欣莹 ; 冀小平
  • 英文作者:LI Siyu;XIE Jun;ZOU Xuejun;XU Xinying;JI Xiaoping;College of Information Engineering,Taiyuan University of Technology;
  • 关键词:Biterm主题模型 ; 双词 ; 词向量 ; 双词语义 ; 吉布斯采样
  • 英文关键词:Biterm Topic Model(BTM);;double words;;word vector;;double words semantic;;Gibbs sampling
  • 中文刊名:JSJC
  • 英文刊名:Computer Engineering
  • 机构:太原理工大学信息工程学院;
  • 出版日期:2019-01-15
  • 出版单位:计算机工程
  • 年:2019
  • 期:v.45;No.496
  • 基金:山西省回国留学人员科研项目(2015-045)
  • 语种:中文;
  • 页:JSJC201901035
  • 页数:7
  • CN:01
  • ISSN:31-1289/TP
  • 分类号:216-222
摘要
针对Biterm主题模型短文本文档的双词产生过程中词对之间缺乏语义联系的情况,提出一种融入词对语义扩展的Biterm主题模型。考虑双词的语义关系,引入词向量模型。通过训练词向量模型,判断词与词之间的语义距离,并根据语义距离对Biterm主题模型进行双词语义扩展。实验结果表明,与现有Biterm主题模型相比,该模型不仅具有较好的短文本主题分类效果,而且双词间的语义关联性能及主题词义聚类性能也得到明显提升。
        Aiming at the lack of semantic connection between double words in Biterm Topic Model( BTM) short text documents,a BTM based on semantic extension of double words is proposed. Considering the semantic relationship between each word in double words,the word vector model is introduced. By training the word vector model,the semantic distance between word in double words is judged,and the BTM is extended according to the semantic distance.Experimental results showthat,compared with the existing BTM,this model not only has better short text topic classification effect,but also improves the performance of semantic association and topic meaning clustering between double words.
引文
[1]徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1436.
    [2]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
    [3]HOFMANN T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and development in Information Retrieval.New York,USA:ACM Press,1999:50-57.
    [4]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of M achine Learning Research,2003,3:993-1022.
    [5]HONG L,DAVISON B.Empirical study of topic modelingin Tw itter[C]//Proceedings of the 1st Workshop on Social M edia Analytics.New York,USA:ACM Press,2010:80-88.
    [6]SRIDHAR V K R.Unsupervised topic modeling for short texts using distributed representations of w ords[C]//Proceedings of the Workshop on Vector Space M odeling for Natural Language Processing.New York,USA:ACM Press,2015:192-200.
    [7]杨萌萌,黄浩,程露红,等.基于LDA主题模型的短文本分类[J].计算机工程与设计,2016,37(12):3371-3377.
    [8]CHENG X,YAN X,LAN Y,et al.BTM:topic modeling over short texts[J].IEEE Transactions on Know ledge and Data Engineering,2014,26(12):2928-2941.
    [9]PANG Jianhui,LI Xiangsheng,XIE Haoran,et al.SBTM:topic modeling over short texts[C]//Proceedings of Database Systems for Advanced Applications.Berlin,Germany:Springer,2016:43-56.
    [10]李振兴,王松.基于卡方特征和BTM融合的短文本分类方法[J].兰州交通大学学报,2016,35(1):36-41.
    [11]郑诚,吴文岫,代宁.融合BTM主题特征的短文本分类方法[J].计算机工程与应用,2016,52(13):95-100.
    [12]孙锐,郭晟,姬东鸿.融入事件知识的主题表示方法[J].计算机学报,2017,40(4):791-804.
    [13]奚雪峰,周国栋.面向自然语言处理的深度学习研究[J].自动化学报,2016,42(10):1445-1465.
    [14]LU Tingting,HOU Shifeng,CHEN Zhenxiang,et al.An intention-topic model based on verbs clustering and short texts topic mining[C]//Proceedings of 2015 IEEEInternational Conference on Computer and Information Technology;Ubiquitous Computing and Communications;Dependable,Autonomic and Secure Computing;Pervasive Intelligence and Computing.Washington D.C.,USA:IEEEPress,2015:837-842.
    [15]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of w ord representations in vector space[EB/OL].[2017-12-07].http://cn.arxiv.org/abs/1301.3781.
    [16]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of w ords and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
    [17]MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing semantic coherence in topic models[C]//Proceedings of Conference on Empirical M ethods in Natural Language Processing.Stroudsburg,USA:Association for Computational Linguistics,2011:262-272.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700