科研实体名称规范的研究与实践
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Constructing Name Authority for Research Entities
  • 作者:张建勇 ; 钱力 ; 于倩倩 ; 董智鹏 ; 黄永文 ; 刘建华 ; 郭舒 ; 王峰
  • 英文作者:Zhang Jianyong;Qian Li;Yu Qianqian;Dong Zhipeng;Huang Yongwen;Liu Jianhua;Guo Shu;Wang Feng;National Science Library, Chinese Academy of Sciences;Department of Library, Information and Archives Management, University of Chinese Academy of Sciences;Institute of Agricultural Information, Chinese Academy of Agricultural Sciences;Library of Shanghai Tech University;National Computer Network Emergency Response Technical Team/Coordination Center of China;Institute of Automation, Chinese Academy of Sciences;
  • 关键词:名称规范 ; 期刊规范 ; 机构规范 ; 基金规范 ; 作者规范
  • 英文关键词:Name Authority;;Journal Authority;;Institution Authority;;Fund Authority;;Author Authority
  • 中文刊名:XDTQ
  • 英文刊名:Data Analysis and Knowledge Discovery
  • 机构:中国科学院文献情报中心;中国科学院大学图书情报与档案管理系;中国农业科学院农业信息研究所;上海科技大学图书馆;国家互联网应急中心;中国科学院自动化研究所;
  • 出版日期:2019-01-25
  • 出版单位:数据分析与知识发现
  • 年:2019
  • 期:v.3;No.25
  • 基金:国家科技图书文献中心(NSTL)资助项目“名称规范数据库建设”(项目编号:科1817);; 中国科学院文献情报中心青年人才领域前沿项目“基于深度学习的名称规范方法研究”(项目编号:G180171001);中国科学院文献情报中心重点任务专项“科研人员研究方向和研究重点分析”(项目编号:院1643)的研究成果之一
  • 语种:中文;
  • 页:XDTQ201901005
  • 页数:11
  • CN:01
  • ISSN:10-1478/G2
  • 分类号:31-41
摘要
【目的】建立机构规范、作者规范、期刊规范、基金规范,为发现系统、科研实体分析评价等建立数据基础。【方法】以多源异构数据为基础,对数据进行汇聚和融合,形成具有唯一标识符的统一的结构化数据。依据名称规范元数据模型,对科研实体及实体间的关系进行抽取。针对不同的科研实体可获取的文献特征,制定不同的消歧规则集合,结合传统字符串匹配方法和深度学习方法进行文本相似度计算。【结果】形成包含260多万条数据的机构规范库、2 300多万条数据的作者规范库、3万多条数据的期刊规范库和200多万条数据的基金规范库。以NSTL机构规范为例,与InCites机构规范进行对比,结果显示所遴选的美、英、中3个国家的6所高校,对标吻合度平均值达到86.8%。【局限】所提出的消歧规则和算法在处理文献特征表达形式多样性方面有待进一步细化和提升;需对具体数据源数据情况进行分析,以选择合适的算法模型。【结论】本研究提出了多源异构数据汇聚融合方法,设计了科研实体消歧规则和算法,能够有效实现名称规范数据库建设的规范性和全面性。
        [Objective] This paper aims to construct name authority for authors, institutions, journals, and funding, etc. [Methods] First, we loaded, cleansed, transformed, integrated and merged names from multiple sources to create uniform structured data with unique identifiers. Then, we used the metadata model for name authority to extract research entities and relationships among them. Finally, we proposed disambiguation algorithms, such as Levenshtein Distance, Jaccard, word2vec and CNN, for different research entities. [Results] Our study created name authority databases for authors(23 million records), institutions(2.6 million records), journals(30,000 records), and funding(2 million records). We chose six institutions' names from NSTL and compared them with those from Incites. We found the average precision reached 86.8%. [Limitations] The proposed disambiguation strategies and algorithms need to be further refined and improved in dealing with the diverse expressions of selected disambiguation feature. The analysis of data from different data sources are needed, in order to apply appropriate algorithms. [Conclusions] The proposed method and disambiguation strategies could improve the performance and comprehensiveness of databases for name authority.
引文
[1]程颖.资源发现系统元数据的问题与思考[J].图书情报工作,2015,59(9):104-110,126.(Cheng Ying.Problem and Thought on the Metadata of Resource Discovery System[J].Library and Information Service,2015,59(9):104-110,126.)
    [2]Niu J.Evolving Landscape in Name Authority Control[J].Cataloging&Classification Quarterly,2013,51(4):404-419.
    [3]胡小菁.规范控制:从名称选择到实体管理[J].数字图书馆论坛,2018(1):2-7.(Hu Xiaojing.Authority Control:From Selection of a Name to Entity Management[J].Digital Library Forum,2018(1):2-7.)
    [4]Youtie J,Carley S,Porter A L,et al.Tracking Researchers and Their Outputs:New Insights from ORCIDs[J].Scientometrics,2017,113(1):437-453.
    [5]Chávezaragón A,Cruz J F R,Reyesgalaviz O F,et al.An Algorithm to Tackle the Name Authority Control Problem Using Semantic Information[C]//Proceedings of the 2009Mexican International Conference on Computer Science.IEEE,2010:176-179.
    [6]Fader A,Soderland S,Etzioni O.Scaling Wikipedia-based Named Entity Disambiguation to Arbitrary Web Text[C]//Proceedings of the 2009 IJCAI Workshop on User-contributed Knowledge and Artificial Intelligence:An Evolving Synergy.2009.
    [7]郎君,秦兵,宋巍,等.基于社会网络的人名检索结果重名消解[J].计算机学报,2009,32(7):1365-1374.(Lang Jun,Qin Bing,Song Wei,et al.Person Name Disambiguation of Searching Results Using Social Network[J].Chinese Journal of Computers,2009,32(7):1365-1374.)
    [8]朱小婷.基于本体的中文人名消歧[D].上海:华东师范大学,2013.(Zhu Xiaoting.Chinese Person Name Disambiguation Based on Ontology[D].Shanghai:East China Normal University,2013.)
    [9]Phillips L B.The Temple and the Bazaar:Wikipedia as a Platform for Open Authority in Museums[J].The Museum Journal,2013,56(2):219-235.
    [10]Kiefer C.Sim Pack Project Page[EB/OL].[2018-11-11].https://files.ifi.uzh.ch/ddis/oldweb/ddis/research/simpack/ind ex.html.
    [11]Second String Project Page[EB/OL].[2018-11-11].http://secondstring.sourceforge.net/.
    [12]UK Sheffield University.Sim Metrics[EB/OL].[2018-11-11].http://sourceforge.net/projects/simmetrics/.
    [13]孙海霞,王蕾,吴英杰,等.科技文献数据库中机构名称匹配策略研究[J].数据分析与知识发现,2018,2(8):88-97.(Sun Haixia,Wang Lei,Wu Yingjie,et al.Matching Strategies for Institution Names in Literature Database[J].Data Analysis and Knowledge Discovery,2018,2(8):88-97.)
    [14]Han H,Giles C L,Zha H,et al.Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]//Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries.2004:296-305.
    [15]汪沛,线岩团,郭剑毅,等.一种结合词向量和图模型的特定领域实体消歧方法[J].智能系统学报,2016,11(3):366-374.(Wang Pei,Xian Yantuan,Guo Jianyi,et al.A Novel Method Using Word Vector and Graphical Models for Entity Disambiguation in Specific Topic Domains[J].CAAITransactions on Intelligent Systems,2016,11(3):366-374.)
    [16]马晓军,郭剑毅,王红斌,等.融合词向量和主题模型的领域实体消歧[J].模式识别与人工智能,2017,30(12):1130-1137.(Ma Xiaojun,Guo Jianyi,Wang Hongbin,et al.Entity Disambiguation in Specific Domains Combining Word Vector and Topic Models[J].Pattern Recognition and Artificial Intelligence,2017,30(12):1130-1137.)
    [17]黄艳芬.FRAD概念模型与CNMARC规范控制[J].图书情报工作,2009,53(12):125-128.(Huang Yanfen.Conception Model of FRAD and Authority Control of CNMARC[J].Library and Information Service,2009,53(12):125-128.)
    [18]王景侠.书目框架(BIBFRAME)模型演进分析及启示[J].数字图书馆论坛,2016(10):67-72.(Wang Jingxia.Evolution Analysis of BIBFRAME Model and Its Enlightenment[J].Digital Library Forum,2016(10):67-72.)
    [19]张璇.RDA对规范控制思想的阐释及实践革新探析[J].图书馆研究与工作,2017(10):31-37.(Zhang Xuan.Exploration of RDA Interpretation of Authority Control and Practice Reform[J].Library Science Research&Work,2017(10):31-37.)
    [20]名称规范元数据标准[EB/OL].[2018-11-11].http://spec.nstl.gov.cn.)(Name Authority Metadata Specification[EB/OL].[2018-11-11].http://spec.nstl.gov.cn.)
    [21]Kainulainen J J.Clustering Algorithms:Basics and Visualization[EB/OL].[2018-11-11].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.1490.
    [22]Baidu NLP[EB/OL].[2018-11-11].https://www.sohu.com/a/149089880_465975.
    [23]Zehnalova S,Horak Z,Kudelka M,et al.Evolution of Author’s Topic in Authorship Network[C]//Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining(ASONAM 2012).IEEEComputer Society,2012.
    [24]Newman M E J.Scientific Collaboration Networks.II.Shortest Paths,Weighted Networks,and Centrality[J].Physical Review E,2001,64:016132.
    [25]Newman M E J.Scientific Collaboration Networks.I.Network Constructionand Fundamental Results[J].Physical Review E,2001,64:016131.
    [26]Newman M E J.The Structure of Scientific Collaboration Networks[J].Proceedings of the National Academy of Sciences of the United States of America,2000,98(2):404-409.
    [27]彭以祺,吴波尔,沈仲祺.国家科技图书文献中心“十三五”发展规划[J].数字图书馆论坛,2016(11):12-20.(Peng Yiqi,Wu Boer,Shen Zhongqi.The 13th Five-Year Plan for the Development of National Science and Technology Library[J].Digital Library Forum,2016(11):12-20.)
    [28]张建勇,曾燕.文献数据库数据加工规范[M].北京:知识产权出版社,2009.(Zhang Jianyong,Zeng Yan.NSTLLiterature Data Processing Specification[M].Beijing:Intellectual Property Publishing House,2009.)
    [29]Web of Science Core Collection Schema[EB/OL].[2018-10-22].http://ipscience-help.thomsonreuters.com/wos Web Services Ex panded/wos Schema Wo SCCGroup/wos Schema.html.
    [30]Journal Archiving and Interchange Tag Set Versions[EB/OL].[2018-10-28].https://jats.nlm.nih.gov/archiving/versions.html.
    [31]沈仲祺,张建勇.文献元数据设计指南和实践[M].北京:科学技术文献出版社,2017.(Shen Zhongqi,Zhang Jianyong.Guideline and Practice of Literature Metadata Design[M].Beijing:Scientific and Technical Documentation Press,2017.)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700