基于机器学习的论文作者名消歧方法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on author name disambiguation method based on machine learning
  • 作者:邓可君 ; 华凯 ; 邓昌明 ; 姜宁 ; 袁玲 ; 彭一明 ; 张治坤
  • 英文作者:DENG Ke-Jun;HUA Kai;DENG Chang-Ming;JIANG Ning;YUAN Ling;PENG Yi-Ming;ZHANG Zhi-Kun;Computer Center, Peking University;
  • 关键词:作者名消歧 ; 机器学习 ; 文本特征提取
  • 英文关键词:Author name disambiguation;;Machine learning;;Text feature extraction
  • 中文刊名:SCDX
  • 英文刊名:Journal of Sichuan University(Natural Science Edition)
  • 机构:北京大学计算中心;
  • 出版日期:2019-03-25 16:12
  • 出版单位:四川大学学报(自然科学版)
  • 年:2019
  • 期:v.56
  • 语种:中文;
  • 页:SCDX201902010
  • 页数:5
  • CN:02
  • ISSN:51-1595/N
  • 分类号:59-63
摘要
本文提出了一种基于规则匹配和机器学习的论文作者名自动化消歧方法:首先基于人工构建的人名匹配规则确定候选作者,对于存在多个候选人的情况,基于论文的属性信息(例如合作者、标题、摘要、关键词和出版物名称等)提取特征,然后选取合适的机器学习算法进行消歧.实验效果表明K近邻和Softmax分类器较适合于论文作者名消歧任务;此外,将作者信息与论文的其他信息分开提取特征能够有效提高作者名消歧的准确性.
        This paper proposes an automatic article author name disambiguation method based on rule matching and machine learning. For each article, the candidate authors are determined based on artificial constructed name matching rules firstly. For the cases of multiple candidates, features are extracted from the attribute information of the article, such as collaborators, title, abstract, key words and publication name, and then selected machine learning models are applied to author name disambiguating. The experimental results show that the K-nearest neighbor and Softmax classifier are more suitable for the author name disambiguation task than other models. In addition, extracting features of the authors information separatelycan from other information effectively improve the accuracy of the author namedisambiguation.
引文
[1] Smalheiser N R, Torvik V I. Author name disambiguation [J]. Annu Rev Inf Sci Tec, 2009, 43: 1.
    [2] 郭舒.文献数据库中作者名自动化消歧方法应用研究 [J]. 情报杂志, 2013, 32: 132.
    [3] Treeratpituk P, Giles C L. Disambiguating authors in academic publications using random forests [C]//Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries. Austin, TX, USA: ACM, 2009.
    [4] Han H, Giles L, Zha H, et al. Two supervised learning approaches for name disambiguation in author citations [C]//Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries.Tucson, AZ, USA: ACM, 2004.
    [5] Han W, Xu B, Zhao T. Study on Chinese person name disambiguation based on multi-stage strategy [C]//Proceedings of 2011 the 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). Shanghai, China: IEEE, 2011.
    [6] Levin M, Krawczyk S, Bethard S, et al. Citation-based bootstrapping for large-scale author disambiguation [J]. J Am Soc Inf Sci Tec, 2012, 63: 1030.
    [7] Salton G, Wong A, Yang C S. A vector space model for automatic indexing [J]. Commun ACM, 1975,18: 613.
    [8] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval [J]. Inform Process Manag, 1988, 24: 513.
    [9] 李航. 统计学习方法 [M]. 北京: 清华大学出版社, 2012.
    [10] 高云龙, 左万利, 王英, 等. 基于集成神经网络的短文本分类模型 [J]. 吉林大学学报: 理学版, 2018, 56: 933.
    [11] 陈晨, 张璐, 伍之昂. 词句协同排序的自动摘要算法 [J].江苏大学学报: 自然科学版, 2016, 37: 443.
    [12] 周顺先, 蒋励, 林霜巧, 等. 基于 Word2vector 的文本特征化表示方法 [J]. 重庆邮电大学学报: 自然科学版, 2018, 30: 272.
    [13] 黄江平, 姬东鸿. 基于句子语义距离的释义识别研究 [J].四川大学学报: 工程科学版, 2016, 48: 202.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700