摘要
针对Biterm主题模型短文本文档的双词产生过程中词对之间缺乏语义联系的情况,提出一种融入词对语义扩展的Biterm主题模型。考虑双词的语义关系,引入词向量模型。通过训练词向量模型,判断词与词之间的语义距离,并根据语义距离对Biterm主题模型进行双词语义扩展。实验结果表明,与现有Biterm主题模型相比,该模型不仅具有较好的短文本主题分类效果,而且双词间的语义关联性能及主题词义聚类性能也得到明显提升。
Aiming at the lack of semantic connection between double words in Biterm Topic Model( BTM) short text documents,a BTM based on semantic extension of double words is proposed. Considering the semantic relationship between each word in double words,the word vector model is introduced. By training the word vector model,the semantic distance between word in double words is judged,and the BTM is extended according to the semantic distance.Experimental results showthat,compared with the existing BTM,this model not only has better short text topic classification effect,but also improves the performance of semantic association and topic meaning clustering between double words.
引文
[1]徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1436.
[2]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
[3]HOFMANN T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and development in Information Retrieval.New York,USA:ACM Press,1999:50-57.
[4]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of M achine Learning Research,2003,3:993-1022.
[5]HONG L,DAVISON B.Empirical study of topic modelingin Tw itter[C]//Proceedings of the 1st Workshop on Social M edia Analytics.New York,USA:ACM Press,2010:80-88.
[6]SRIDHAR V K R.Unsupervised topic modeling for short texts using distributed representations of w ords[C]//Proceedings of the Workshop on Vector Space M odeling for Natural Language Processing.New York,USA:ACM Press,2015:192-200.
[7]杨萌萌,黄浩,程露红,等.基于LDA主题模型的短文本分类[J].计算机工程与设计,2016,37(12):3371-3377.
[8]CHENG X,YAN X,LAN Y,et al.BTM:topic modeling over short texts[J].IEEE Transactions on Know ledge and Data Engineering,2014,26(12):2928-2941.
[9]PANG Jianhui,LI Xiangsheng,XIE Haoran,et al.SBTM:topic modeling over short texts[C]//Proceedings of Database Systems for Advanced Applications.Berlin,Germany:Springer,2016:43-56.
[10]李振兴,王松.基于卡方特征和BTM融合的短文本分类方法[J].兰州交通大学学报,2016,35(1):36-41.
[11]郑诚,吴文岫,代宁.融合BTM主题特征的短文本分类方法[J].计算机工程与应用,2016,52(13):95-100.
[12]孙锐,郭晟,姬东鸿.融入事件知识的主题表示方法[J].计算机学报,2017,40(4):791-804.
[13]奚雪峰,周国栋.面向自然语言处理的深度学习研究[J].自动化学报,2016,42(10):1445-1465.
[14]LU Tingting,HOU Shifeng,CHEN Zhenxiang,et al.An intention-topic model based on verbs clustering and short texts topic mining[C]//Proceedings of 2015 IEEEInternational Conference on Computer and Information Technology;Ubiquitous Computing and Communications;Dependable,Autonomic and Secure Computing;Pervasive Intelligence and Computing.Washington D.C.,USA:IEEEPress,2015:837-842.
[15]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of w ord representations in vector space[EB/OL].[2017-12-07].http://cn.arxiv.org/abs/1301.3781.
[16]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of w ords and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[17]MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing semantic coherence in topic models[C]//Proceedings of Conference on Empirical M ethods in Natural Language Processing.Stroudsburg,USA:Association for Computational Linguistics,2011:262-272.