基于多元关系融合的科技文本主题识别方法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Topic Identification Based on Multi-Semantic Relation Fusion
  • 作者:许海云 ; 武华维 ; 罗瑞 ; 董坤 ; 李婧
  • 英文作者:XU Haiyun;WU Huawei;LUO Rui;DONG Kun;LI Jing;
  • 关键词:文本主题识别 ; 多元关系 ; 数据融合 ; 关系融合 ; 主题聚类
  • 英文关键词:Topic recognition based on text;;Multiple relations;;Data fusion;;Relational fusion;;Topic clustering
  • 中文刊名:ZGTS
  • 英文刊名:Journal of Library Science in China
  • 机构:中国科学院成都文献情报中心,中国科学技术信息研究所;中国科学院成都文献情报中心,中国科学院大学;山东理工大学科技信息研究所;中国科学院成都文献情报中心;
  • 出版日期:2019-01-15 15:28
  • 出版单位:中国图书馆学报
  • 年:2019
  • 期:v.45;No.239
  • 基金:国家自然科学基金项目“基于科学—技术主题关联分析的创新演化路径识别方法研究(编号:71704170)”;; 中国科学院知识产权信息服务专项“面向干细胞领域知识发现的科研信息化应用”(编号:KFJ-EWSTS-032)研究成果之一~~
  • 语种:中文;
  • 页:ZGTS201901006
  • 页数:13
  • CN:01
  • ISSN:11-2746/G2
  • 分类号:84-96
摘要
当前文本主题获取方法大多依靠单一关联分析,不能全面分析可获取信息,难以准确获取科技发展主题。科技文献的主题词、作者和引文之间蕴含了以研究主题内容为纽带的语义关联关系,主题词共现关系、引文关系和合著关系分别从不同的角度展现了主题关联关系。因此,本文根据主题词之间语义关系距离的远近,将主题识别中主题词关联分为基础关系、强化关系和新增关系,在此基础上提出面向主题识别的多元关系抽取及关系融合方法;并以基因工程疫苗的研发与制备领域为例进行领域实证分析,利用PathSelClus算法实现基于多元关系融合的主题聚类,通过对比实验证明多元关系融合可以有效提高实证领域的文本主题聚类效果,而未来多关系融合主题识别则是需要重点关注的问题。图4。表6。参考文献19。
        One of the typical characteristics of big data analysis is multivariate data relation processing. The multirelationship analysis of topics refers to the analysis of the relationships established between topics and other measurable entities( MEs). There are many MEs in scientific or technological documents,and they relate directly or indirectly with knowledge units. However, the current topic acquisition methods for this document rely mostly on single association analysis,so it is difficult to obtain the topics of scientific or technological developments accurately. Therefore,finding the multi-relationships between the entities of a document is one of the key technologies for accurate topic identification in massive scientific or technological literatures.This paper firstly reviewed the research status of multi-relations fusion in topic identification,summarized the various measurable relationships of topic terms in the scientific or technical literatures. The research found that there are semantic relations between topic terms,authors and citations in the scientific or technical literatures based on the topic content,with their co-occurrence relations can reveal respectively the topic association from different perspectives. Based on the distance between the semantic distances of topic terms,we divided the topic terms associations in topic identification into basic relations,strengthened relations and additional relations. For the strengthened relations and additional relations,any type of MEs can be the intermediate node of the topic terms association. Choosing the appropriate intermediate MEs is especially important for fully establishing the semantic association between the topic terms. This paper chooses authors,references and citation literatures as the intermediate MEs of topic term strengthened relations and additional relations. Seven types of topic associations are formed by the topic terms and these MEs. The fusion relationship can make up for the lack of information of a single association relationship through obtaining more accurate topic association.The acquisition of multiple topic associations is the basis of multi-relations fusion. Whether the multirelations fusion algorithm can enhance the meaningful topic semantic association and weaken the noise correlation is also an important step to achieve multi-relationship topic clustering. This study gives a calculation method of both direct and indirect association weights of MEs with reference to Morris ' s definition of association weights of multi-relational MEs. Finally, a multi-relationship extraction and relationship fusion method for topic identification is proposed. Finally,this paper took genetic engineering vaccine as experimental field, through relational matrix acquisition algorithm proposed by selfprogramming,seven types of topic correlation matrices were extracted and the correlation association matrices were realized by PathSelClus algorithm. With a comparative analysis,it proves that multi-relations fusion can effectively improve the effect of topic clustering.The PathSelClus relationship fusion used in this paper is merely one of various existing multi-relations fusion methods,and it is highly dependent on expert knowledge. The quality of the annotation results directly affects the clustering results,and there is no effective way to determine the number of clusters. We think that much work needs to be done to further the study in the future,such as,how is the effect of other fusion methods working in topic identification from text? What will the performance comparison be? At the same time,according the research objective,we will explore more fusion methods and integrate them to obtain fusion results with more information. 4 figs. 6 tabs. 19 refs.
引文
[1] 许海云, 董坤, 隗玲, 等. 科学计量中多源数据融合方法研究述评[J]. 情报学报,2018, 37(3): 318-328.( Xu Haiyun, Dong Kun, Wei Ling, et al.Research on multi-source data fusion method in Scientometrics[J]. Journal of the China Society for Scientific and Technical Information,2018, 37(3): 318-328.)
    [2] Xu H Y, Yue Z H, Wang C, et al. Multi-source data fusion study in scientometrics[J]. Scientometrics, 2017, 111(2), 773-792.
    [3] Janssens F, Zhang L, De Moor B. Hybrid clustering for validation and improvement of subject-classification schemes[J]. Information Processing & Management, 2009, 45(6): 683-702.
    [4] 许海云, 董坤, 刘春江, 等. 文本主题识别关键技术研究综述[J]. 情报科学, 2017, 35(1): 153-160. (Xu Haiyun, Dong Kun, Liu Chunjiang, et al. A review on topic identification of scientific text files [J]. Information Science, 2017, 35(1): 153-160.)
    [5] Van Den Besselaar P, Heimeriks G. Mapping research topics using word-reference co-occurrences: a method and an exploratory case study[J]. Scientometrics, 2006, 68(3): 377-393.
    [6] Wen B, Horlings E, Mari?lle V D Z. Mapping science through bibliometric triangulation: an experimental approach applied to water research[J]. Journal of the Association for Information Science & Technology, 2016.
    [7] Dong K, Xu H, Luo R. An integrated method for interdisciplinary topic identification and prediction: a case study on information science and library science[J]. Scientometrics, 2018,115(2):849-868.
    [8] Zhang Y, Shang L, Huang L, et al. A hybrid similarity measure method for patent portfolio analysis[J]. Journal of Informetrics, 2016, 10(4): 1108-1130.
    [9] Calero-Medina C, Noyons E C. Combining mapping and citation network analysis for a better understanding of the scientific development: the case of the absorptive capacity field[J]. Journal of Informetrics, 2008, 2(4): 272-279.
    [10] He X, Ding C H, Zha H. Automatic topic identification using webpage clustering[C]//Data Mining, 2001 ICDM 2001, Proceedings IEEE International Conference on, 2001: 195-202.
    [11] He X, Zha H, Ding C H. Web document clustering using hyperlink structures[J]. Computational Statistics & Data Analysis, 2002, 41(1): 19-45.
    [12] Wang Y, Kitsuregawa M. Evaluating contents-link coupled web page clustering for web search results[C]//Proceedings of the eleventh international conference on Information and knowledge management, 2002: 499-506.
    [13] Janssens F, Gl?nzel W, De Moor B. Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis[C]// Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007: 360-369.
    [14] Janssens F A. Clustering of scientific fields by integrating text mining and Bibliometrics[D]. Belgium:Katholieke Universiteit Leuven, 2007.
    [15] 郭红梅, 孔贝贝, 张智雄. 基于多重文本关系图中clique子团聚类的主题识别方法研究[J]. 情报学报, 2017, 36(5): 433-442.( Guo Hongmei, Kong Beibei, Zhang Zhixiong. Study on textual topic identification by clustering clique structure in multi-relationship text graph[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(5): 433-442.)
    [16] Morris S A, Yen G G. Construction of bipartite and unipartite weighted networks from collections of journal papers[J]. Physics, 2005.
    [17] Morris S A. Unified mathematical treatment of complex cascaded bipartite networks: the case of collections of journal papers[D]. Stillwater:Oklahoma State University, 2005
    [18] Sun Y, Norick B, Han J, et al. PathSelClus: integrating meta-path selection with user-guided object clustering in heterogeneous information networks[J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2013, 7(3): 11.
    [19] Derwent data analyzer[EB/OL].[2018-10-07]. https://www.thevantagepoint.com/tda-home.html.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700