摘要
[目的/意义]LDA应用于细分领域主题分析时,所得结果普遍存在可读性和可解释性欠佳的问题。在情报分析实践中采用领域术语开展主题分析已逐渐成为一种趋势,有必要专门将其与传统选词方案所得主题结果进行量化评估对比,以检验其有效性,为后续情报理论研究与实践应用提供支撑。[方法/过程]首先,在文献调研的基础上,选定"名词+动词""名词""领域术语"三种选词方案,构建具有多组参数(主题数和词数)的LDA对比实验,并提出基于领域专家分析和主题一致性计算的定性、定量评估方法,以对比不同方案所得主题结果的可解释性和一致性。随后,以心血管领域为例,设定具体实验参数,共开展600轮具体LDA实验,并对其结果进行分析。[结果/结论]实验结果表明,以领域术语作为选词方案所得到的LDA主题可解释性、可读性更好,情报研究中涉及细分领域主题分析可尽量采用领域术语作为分析对象。
[Purpose/significance] When LDA is applied to the subject analysis of subdivisions,the results are poorly readable and poorly interpretable.In the practice of intelligence analysis,the use of domain terminology for topic analysis has gradually become a trend.It is necessary to specifically compare and compare the results of the traditional word selection program to verify its validity,and to provide support to follow-up intelligence theory research and practice.[Method/process] Firstly,on the basis of literature research,three word selection schemes such as "noun + verb","noun" and "domain term" are selected to construct LDA with multiple sets of parameters(number of topics and number of words).Contrasting experiments and presenting qualitative and quantitative evaluation methods based on domain expert analysis and subject consistency calculation to compare the interpretability and consistency of the thematic results obtained by different schemes.Subsequently,taking the cardiovascular field as an example,specific experimental parameters were set,and 600 rounds of specific LDA experiments were carried out,and the results were analyzed.[Result/conclusion] The experimental results show that the LDA theme obtained by using the domain term as the word selection scheme is more interpretative and readable.The subject analysis of the subdivisions involved in intelligence research can use domain terminology as the analysis object.
引文
[1] DU L,BUNTINE W L,JIN H.Sequential latent dirichlet allocation:Discover underlying topic structures within a document[C] //IEEE,International Conference on Data Mining.IEEE,2011:148-157.
[2] 裴超,肖诗斌,江敏.基于改进的LDA主题模型的微博用户聚类研究[J].情报理论与实践,2016,39(3):135-139.
[3] 唐晓波,向坤.基于LDA模型和微博热度的热点挖掘[J].图书情报工作,2014,58(5):58-63.
[4] 梁珊,邱明涛,马静.基于LDA-WO混合模型的微博话题有序特征抽取研究[J].情报科学,2017(7):44-49.
[5] 邱明涛,马静,张磊,等.基于可扩展LDA模型的微博话题特征抽取研究[J].情报科学,2017(4):22-26.
[6] 范云满,马建霞.利用LDA的领域新兴主题探测技术综述[J].现代图书情报技术,2012(12):58-65.
[7] 方小飞,黄孝喜,王荣波,等.基于LDA模型的移动投诉文本热点话题识别[J].数据分析与知识发现,2017,1(2):19-27.
[8] 宫小翠,安新颖.基于LDA模型的医学领域主题分裂融合探测[J].图书情报工作,2017,61(18):76-83.
[9] DAVID B,ANDREW NG,MICHAEL J.Latent dirichlet allocation[J].The Journal of Machine Learn Research,2003(3):993-1022.
[10] 张勇.基于词性与LDA主题模型的文本分类技术研究[D].合肥:安徽大学,2016:31-37.
[11] 马柏樟,颜志军.基于潜在狄利特雷分布模型的网络评论产品特征抽取方法[J].计算机集成制造系统,2014,20(1):96-103.
[12] 王琳.领域分析范式视角下知识组织中若干问题研究[J].图书情报工作,2011,55(4):90-105.
[13] LANCASTERF W.Indexing and abstracting in theory and practice[M].London:Facet,2003.
[14] BUCKLAND M K,CHEN A,GEBBIE M,et al.Variation by subdomain in indexes to knowledge organization systems[C]//Proceedings of the 6th International ISKO Conference.Wuerzburg,Germany:Ergon Verlag,2000:48-53.
[15] National Information Standards Organization (US).Guidelines for the construction,format,and management of monolingual controlled vocabularies[M].NISO Press,2005.
[16] MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing Semantic Coherence in Topic Models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2011:262.
[17] KEITH S,PHILIP K M,DAVID,et al.Exploring Topic Coherence over many models and many topics[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,Korea,2012:952-961.
[18] 39疾病百科-心血管内科疾病[EB /OL].[2018-07-23].http://jbk.39.net/bw/xinxueguanneike_t1.
[19] 陈果,肖璐.网络社区中的知识元链接体系构建研究[J].数据分析与知识发现,2017,1(11):75-83.
[20] NLPIR汉语分词系统[EB/OL].[2018-07-23].http://ictclas.nlpir.org/.
[21] Rehurek R.Gensim 3.5.0[EB/OL].[2018-07-23].https://pypi.org/project/gensim/.