摘要
挖掘文档集合中主题词的概率分布可对文档内容做概要性了解。进一步探寻给定主题下单词之间的连接关系不仅能丰富主题词的含义,而且能更细致地表现主题的层次和聚集关系。为此,针对带标签的文档集合,基于标注潜在狄利克雷分布(LDA)分析后的吉布斯采样结果,提出一种给定主题下2个单词共现的概率计算方法,并在此基础上构建主题文本网络。与逐点标注LDA(PL-LDA)模型相比,该方法不扩充原始文件,计算量小,耗时短。在航空安全报告数据集上的实验结果表明,对标记单词较多的主题,该方法能够较好地展示主题词的分布情况以及它们之间的复杂联系。
Mining the probability distribution of topic words in document collection can make a summary understanding of the document content. Further exploring the connection relationship between words in a given topic not only riches the meaning of topic words, but also reveals the hierarchy and aggregation of topics. For the labeled document collection,this paper proposes a method to compute the conditional probability of two words under a given topic based on Gibbs sampling outputs of labeled Latent Dirichlet Allocation( LDA),and a topical text network is also constructed. Compared with Pointwise Labeled-LDA( PL-LDA) model,this method does not extend the original document and needs less computation cost and shorter time. Experiments on the data set of aviation safety reports show that,f or topics with many labeled words,this method can better display the distribution of subject words and the complex relationship between them.
引文
[1]Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[2]Zhang Duo,Zhai Chengxiang,Han Jiawei,et al.Topic Modeling for OLAP on Multidimensional Text Databases:Topic Cube and Its Applications[J].Statistical Analysis and Data Mining:The ASA Data Science Journal,2009,2(5/6):378-395.
[3]Blei D M,Mcauliffe J D.Supervised Topic Models[J].Advances in Neural Information Processing Systems,2010,3:327-332.
[4]Ramage D,Hall D,Nallapati R,et al.Labeled LDA:A Supervised Topic Model for Credit Attribution in Multilabeled Corpora[C]//Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing.New York,USA:ACM Press,2009:248-256.
[5]张青,吕钊.基于主题扩展的领域问题分类方法[J].计算机工程,2016,42(9):202-207,213.
[6]常鹏,冯楠,马辉.一种基于词共现的文档聚类算法[J].计算机工程,2012,38(2):213-214.
[7]耿焕同,蔡庆生,于琨,等.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学版),2006,42(2):156-162.
[8]赵文清,侯小可.基于词共现图的中文微博新闻话题识别[J].智能系统学报,2012,7(5):444-449.
[9]Church K W,Hanks P.Word Association Norms,Mutual Information,and Lexicography[J].Computational Linguistics,1990,16(1):22-29.
[10]Manning C,Schütze H.Foundations of Statistical Natural Language Processing[M].Cambridge,USA:MIT Press,1999.
[11]Turney P.Mining the Web for Synonyms:PMI-IR Versus LSA on TOEFL[C]//Proceedings of the 12th European Conference on Machine Learning.New York,USA:ACM Press,2001:491-502.
[12]王振宇,吴泽衡,胡方涛.基于HowNet和PMI的词语情感极性计算[J].计算机工程,2012,38(15):187-189.
[13]周剑峰,阳爱民,周咏梅,等.基于二元搭配词的微博情感特征选择[J].计算机工程,2014,40(6):162-165.
[14]Smith A,Chuang J,Hu Y,et al.Concurrent Visualization of Relationships Between Words and Topics in Topic Models[C]//Proceedings of Workshop on Interactive Language Learning,Visualization,and Interfaces.Baltimore,USA:[s.n.],2014:79-82.
[15]Han Lushan,Finin T,McNamee P,et al.Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(6):1307-1322.
[16]张志远,霍纬纲.一种基于PL-LDA模型的主题文本网络构建方法[EB/OL].[2016-08-22].http://pan.baidu.com/s/1o8gxrui.