细分领域LDA主题分析中选词方案的效果对比研究

英文篇名：Comparative Study on the Effect of Word Selection Scheme on the Effect of LDA Analysis in Subdivisions
作者：陈果 ; 吴微
英文作者：Chen Guo;
关键词：LDA ; 领域术语 ; 主题分析 ; 领域知识分析 ; 心血管医学
英文关键词：LDA;;domain terminology;;subject analysis;;domain knowledge analysis;;cardiovascular medicine
中文刊名：QBLL
英文刊名：Information Studies:Theory & Application
机构：南京理工大学经济管理学院;江苏省社会公共安全科技协同创新中心;
出版日期：2019-01-31 15:32
出版单位：情报理论与实践
年：2019
期：v.42;No.305
基金：国家社会科学基金青年项目“领域分析视角下的科技词汇语义挖掘与知识演化研究”的成果之一,项目编号:16CTQ024
语种：中文;
页：QBLL201906024
页数：6
CN：06
ISSN：11-1762/G3
分类号：142-147

摘要

[目的/意义]LDA应用于细分领域主题分析时,所得结果普遍存在可读性和可解释性欠佳的问题。在情报分析实践中采用领域术语开展主题分析已逐渐成为一种趋势,有必要专门将其与传统选词方案所得主题结果进行量化评估对比,以检验其有效性,为后续情报理论研究与实践应用提供支撑。[方法/过程]首先,在文献调研的基础上,选定"名词+动词""名词""领域术语"三种选词方案,构建具有多组参数(主题数和词数)的LDA对比实验,并提出基于领域专家分析和主题一致性计算的定性、定量评估方法,以对比不同方案所得主题结果的可解释性和一致性。随后,以心血管领域为例,设定具体实验参数,共开展600轮具体LDA实验,并对其结果进行分析。[结果/结论]实验结果表明,以领域术语作为选词方案所得到的LDA主题可解释性、可读性更好,情报研究中涉及细分领域主题分析可尽量采用领域术语作为分析对象。
[Purpose/significance] When LDA is applied to the subject analysis of subdivisions,the results are poorly readable and poorly interpretable.In the practice of intelligence analysis,the use of domain terminology for topic analysis has gradually become a trend.It is necessary to specifically compare and compare the results of the traditional word selection program to verify its validity,and to provide support to follow-up intelligence theory research and practice.[Method/process] Firstly,on the basis of literature research,three word selection schemes such as "noun + verb","noun" and "domain term" are selected to construct LDA with multiple sets of parameters(number of topics and number of words).Contrasting experiments and presenting qualitative and quantitative evaluation methods based on domain expert analysis and subject consistency calculation to compare the interpretability and consistency of the thematic results obtained by different schemes.Subsequently,taking the cardiovascular field as an example,specific experimental parameters were set,and 600 rounds of specific LDA experiments were carried out,and the results were analyzed.[Result/conclusion] The experimental results show that the LDA theme obtained by using the domain term as the word selection scheme is more interpretative and readable.The subject analysis of the subdivisions involved in intelligence research can use domain terminology as the analysis object.

引文

[1] DU L,BUNTINE W L,JIN H.Sequential latent dirichlet allocation:Discover underlying topic structures within a document[C] //IEEE,International Conference on Data Mining.IEEE,2011:148-157.
    [2] 裴超,肖诗斌,江敏.基于改进的LDA主题模型的微博用户聚类研究[J].情报理论与实践,2016,39(3):135-139.
    [3] 唐晓波,向坤.基于LDA模型和微博热度的热点挖掘[J].图书情报工作,2014,58(5):58-63.
    [4] 梁珊,邱明涛,马静.基于LDA-WO混合模型的微博话题有序特征抽取研究[J].情报科学,2017(7):44-49.
    [5] 邱明涛,马静,张磊,等.基于可扩展LDA模型的微博话题特征抽取研究[J].情报科学,2017(4):22-26.
    [6] 范云满,马建霞.利用LDA的领域新兴主题探测技术综述[J].现代图书情报技术,2012(12):58-65.
    [7] 方小飞,黄孝喜,王荣波,等.基于LDA模型的移动投诉文本热点话题识别[J].数据分析与知识发现,2017,1(2):19-27.
    [8] 宫小翠,安新颖.基于LDA模型的医学领域主题分裂融合探测[J].图书情报工作,2017,61(18):76-83.
    [9] DAVID B,ANDREW NG,MICHAEL J.Latent dirichlet allocation[J].The Journal of Machine Learn Research,2003(3):993-1022.
    [10] 张勇.基于词性与LDA主题模型的文本分类技术研究[D].合肥:安徽大学,2016:31-37.
    [11] 马柏樟,颜志军.基于潜在狄利特雷分布模型的网络评论产品特征抽取方法[J].计算机集成制造系统,2014,20(1):96-103.
    [12] 王琳.领域分析范式视角下知识组织中若干问题研究[J].图书情报工作,2011,55(4):90-105.
    [13] LANCASTERF W.Indexing and abstracting in theory and practice[M].London:Facet,2003.
    [14] BUCKLAND M K,CHEN A,GEBBIE M,et al.Variation by subdomain in indexes to knowledge organization systems[C]//Proceedings of the 6th International ISKO Conference.Wuerzburg,Germany:Ergon Verlag,2000:48-53.
    [15] National Information Standards Organization (US).Guidelines for the construction,format,and management of monolingual controlled vocabularies[M].NISO Press,2005.
    [16] MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing Semantic Coherence in Topic Models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2011:262.
    [17] KEITH S,PHILIP K M,DAVID,et al.Exploring Topic Coherence over many models and many topics[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,Korea,2012:952-961.
    [18] 39疾病百科-心血管内科疾病[EB /OL].[2018-07-23].http://jbk.39.net/bw/xinxueguanneike_t1.
    [19] 陈果,肖璐.网络社区中的知识元链接体系构建研究[J].数据分析与知识发现,2017,1(11):75-83.
    [20] NLPIR汉语分词系统[EB/OL].[2018-07-23].http://ictclas.nlpir.org/.
    [21] Rehurek R.Gensim 3.5.0[EB/OL].[2018-07-23].https://pypi.org/project/gensim/.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700