科技论文中学术信息的提取方法综述
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:A Method Review on Academic Information Extracting from Scientific Papers
  • 作者:胡志刚 ; 田文灿 ; 孙太安 ; 侯海燕
  • 英文作者:HU ZhiGang;TIAN WenCan;SUN TaiAn;HOU HaiYan;Institute of Science of Science and Science and Technology Management, Dalian University of Technology;WISE Laboratory, Dalian University of Technology;
  • 关键词:学术信息 ; 论文全文本 ; 信息提取 ; 机器学习
  • 英文关键词:Academic Information;;Full Text;;Information Extraction;;Machine Learning
  • 中文刊名:SZTG
  • 英文刊名:Digital Library Forum
  • 机构:大连理工大学科学学与科技管理研究所;大连理工大学WISE实验室;
  • 出版日期:2017-10-25
  • 出版单位:数字图书馆论坛
  • 年:2017
  • 期:No.161
  • 基金:国家自然科学基金项目“开放获取背景下的全文引文分析方法与应用研究”(编号:71503031)资助
  • 语种:中文;
  • 页:SZTG201710010
  • 页数:9
  • CN:10
  • ISSN:11-5359/G2
  • 分类号:41-49
摘要
为更好地利用和挖掘学术论文文本,识别并提取学术论文中的学术信息已成为一种非常迫切的现实需求,在文本挖掘、信息检索、主题监测、信息计量学等领域都有广阔的应用前景。学术信息可以分为题录信息、章节信息、引文信息、引用信息和其他信息。本文综述了在PDF和HTML/XML两种不同格式的学术论文全文中,提取各类学术信息的主要方法,并指出这些方法主要面向的格式文本以及可用来提取的信息种类。最后,本文列出了提取学术信息的常用工具。
        In order to make better use of rich information in academic papers, it is a very urgent and realistic requirement to identify and extract academic information within. The academic information extracting has a broad application prospect in text mining, information retrieval, theme monitoring, information metrology and many other fields. There are five kinds of academic information, such as title information, section information, citation information, reference information and other information. This paper reviews the methods of academic information extracting from the full text of academic papers. Different methods could be used to extract different kinds of academic information from different types of full texts, PDF or HTML/XML. Finally, the paper also lists the current tools for extracting academic information.
引文
[1]MAYR P,SCHARNHORST A.Combining bibliometrics and information retrieval:preface[J].Scientometrics,2015,102(3):2191-2192.
    [2]LIU S,CHEN C,DING K,et al.Literature retrieval based on citation context[J].Scientometrics,2014,101(2):1293-1307.
    [3]WILLIAMS K,WU J,CHOUDHURY S R,et al.Scholarly big data information extraction and integration in the Cite Seerχdigital librar y[C]//IEEE Inter national Conference on Date Engineeri Workshops.[S.1.]:[s.n.],2014:68-73.
    [4]WA NG X,CHENG Q,LU W.A nalyzing evolution of research topics with NEViewer:a new method based on dynamic co-word networks[J].Scientometrics,2014,101(2):1253-1271.
    [5]YE S,CHUA T S S,KAN M Y,et al.Document concept lattice for text understanding and summarization[J].Information Processing and Management,2007,43(6):1643-1662.
    [6]LIU X,ZHANG J,GUO C.Full-text citation analysis:a new method to enhance scholarly networks[J].Journal of the American Society for Information Science and Technology,2013,64(9):1852-1863.
    [7]GLENISSON P,GL?NZEL W,PERSSON O.Combining full-text analysis and bibliometric indicators.A pilot study[J].Scientometrics,2005,63(1):163-180.
    [8]赵蓉英,曾宪琴,陈必坤.全文本引文分析——引文分析的新发展[J].图书情报工作,2014,58(9):129-135.
    [9]胡志刚.全文引文分析:理论、方法与应用[M].北京:科学出版社,2016.
    [10]胡志刚,侯海燕,林歌歌.从书信沙龙到开放获取——刍议学术学术论文形态的演化[J].数字图书馆论坛,2016(10):32-37.
    [11]张立.数字出版相关概念的比较分析[J].中国出版,2006(12):11-14.
    [12]ZOU J,LE D,THOMA G R.Locating and parsing bibliographic references in HTML medical articles[J].International Journal on Document Analysis and Recognition,2010,13(2):107-119.
    [13]白杰,杨爱臣.XML结构化数字出版的特点与流程[J].出版广角,2015(5):28-31.
    [14]SOLLACI L B,PEREIRA M G.The introduction,methods,results,and discussion(IMRAD)structure:a fifty-year survey[J].Journal of the Medical Library Association Jmla,2004,92(3):364-367.
    [15]CHOUDHURY S R,TUAROB S,MITRA P,et al.A figure search engine architecture for a chemistry digital library[J].2013:369-370.
    [16]LIU Y,BAI K,MITRA P,et al.Table Seer:automatic table metadata extraction and searching in digital libraries[C]//JCDL’07.Vancouver:[s.n.],2007:91-100.
    [17]JIN J,HAN X,WANG Q.Mathematical Formulas Extraction[C]//International Conference on Document Analysis and Recognition,IEEE.[S.1.]:[s.n.],2003:1138-1141.
    [18]COUNCILL I G,GILES C L,HAN H,et al.Automatic acknowledgement indexing:expanding the semantics of contribution in the Cite Seer d ig it al libra r y[C]//I nter nat ional Con ference on K nowledge Capture,Banff:[s.n.],2005:1-8.
    [19]SARIC J,CIMIANO P.Ontology-driven discourse analysis for information extraction[J].Data&Knowledge Engineering,2005,55:59-83.
    [20]FLYNN P,LI Z,MALY K,et al.Automated template-based metadata extraction architecture[C]//International Conference on Asian Digital Libraries:Looking Back 10 Year and Forging New Frontiers.[S.1.]:Springer-Verlag,2007.
    [21]胡志刚,陈超美,刘则渊,等.基于XML全文数据引文分析系统的设计与实现[J].现代图书情报技术,2012(11):71-77.
    [22]GILES C L,BOLLACK ER K D,LAW R ENCE S.Cite Seer:an automatic citation indexing system[C]//Proceedings of the third ACM conference on Digital libraries.[S.1.]:ACM,1998:89-98.
    [23]G OOGLE.I nclu sion Gu idel i ne s for Webma st e r s:I ndex i ng Guidelines[EB/OL].[2017-08-01].https://scholar.google.com/intl/zh-CN/scholar/inclusion.html#indexing.
    [24]GIUFFRIDA G,SHEK E C,YANG J.Knowledge-based metadata extraction from Post Script files[C]//Proceedings of the 5th ACM Conference on Digital Libraries.New York:ACM Press,2000:77-84.
    [25]GROZA T,HANDSCHUH S,HULPUS I.A document engineering approach to automatic extraction of shallow metadata from scientific publications[R/OL].[2017-08-01].https://www.researchgate.net/publication/237536549_A_DOCUMENT_ENGINEERING_APPR OACH_TO_AUTOMATIC_EXTRACTION_OF_SHALLOW_METAD ATA_FROM_SCIENTIFIC_PUBLICATIONS.
    [26]HAN H,GILES C L L,MANAVOGLU E,et al.Automatic document metadata ext raction using suppor t vector machines[C]//Joint Conference on Digital Libraries.[S.1.]:IEEE,2003:37-48.
    [27]ZHANG X,ZOU J,LE D X,et al.A structural SVM approach for reference parsing[J].BMC Bioinformatics,2011,12(3):1-7.
    [28]BAUM L E,PETRIE T.Statistical inference for probabilistic functions of finite state Markov chains[J].Annals of Mathematical Statistics,1966,37(6):1554-1563.
    [29]RABINER L R.A tutorial on hidden Markov models and selected applications in speech recognition[C]//Proceedings of the IEEE.[S.1.]:IEEE,1989,77(2):257-286.
    [30]HETZNER E.A simple method for citation metadata extraction using hidden markov models[C]//Joint Conference on Digital Libraries.[S.1.]:[s.n.],2008:280-284.
    [31]OJOKOH B,ZHANG M,TANG J.A trigram hidden Markov model for metadata extraction from heterogeneous references[J].Information Sciences,2011,181(9):1538-1551.
    [32]CUI B G,CHEN X.An improved Hidden Markov Model for literature metadata extraction[C]//International Conference on Advanced Intelligent Computing Theories and Application:Intelligent Computing,[S.1.]:Springer Berlin Heidelberg,2010,6251(4):205-212.
    [33]PARK D C,HUONG V T L,WOO D M,et al.Information extraction system based on Hidden Markov Model[M].Berlin:Springer Berlin Heidelberg,2009:52-59.
    [34]SONG M,SONG I Y,HU X H,et al.KXtractor:an effective biomedical information extraction technique based on mixture Hidden Markov models[M].Berlin:Springer Berlin Heidelberg,2005:68-81.
    [35]ZHONG P,CHEN J,COOK T.Web information extraction using generalized Hidden Markov Model[C]//IEEE Workshop on Hot Topics in Web Systems and Techologies.[S.1.]:IEEE,2006:1-8.
    [36]XIAO J,ZOU L,LI C.Optimization of Hidden Markov Model by a genetic algorithm for web information extraction[J].International Journal of Computational Intelligence Systems,2007.
    [37]CHI C Y,ZHANG Y.Information extraction from Chinese papers based on Hidden Markov Model[J].Advanced Materials Research,2014:846-847,1291-1294.
    [38]LAFFERTY J D,MCCALLUM A,PEREIRA F C N.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning.[S.1.]:[s.n.],2001,3(2):282-289.
    [39]SCHWARTZ A S,DIVOLI A,HEARST M A.Multiple alignment of citation sentences with conditional random fields and posterior decoding example of unaligned citances[J].Computational Linguistics,2007(6):847-857.
    [40]PENG F,MCCALLUM A.Information extraction from research papers using conditional random fields[J].Information Processing and Management,2006,42(4):963-979.
    [41]PINTO D,MCCALLUM A,WEI X,et al.Table extraction using conditional random fields[C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval.[S.1.]:[s.n.],2003:235-242.
    [42]ORORBIAII A G,WU J,KHABSA M,et al.Big scholarly data in Cite Seer X:information extraction from the web[C]//International Conference.[S.1.]:[s.n.],2015:597-602.
    [43]HENNING V,REICHELT J.Mendeley-A Last.fm for research?[C]//IEEE 4th International Conference on Escience.[S.1.]:IEEE,2008:327-328.
    [44]BEEL J,GIPP B,LANGER S,et al.Docear:an academic literature suite for searching,organizing and creating academic literature[C]//Proceedings of the 11th Annual.[S.1.]:[s.n.],2011:4-6.
    [45]COUNCILL I G,GILES C L,KAN M Y.Pars Cit:an open-source CRF reference string parsing package[J].LREC’08:Proceedings of the 6th International Conference on Language Resources and Evaluation.[S.1.]:[s.n.],2008(3):661-667.
    [46]GUPTA D,MORRIS B,CATAPANO T,et al.A new approach towards bibliographic reference identification,parsing and inline citation matching[C]//Communications in Computer and Information Science.Berlin:Springer Berlin Heidelberg,2009,40:93-102.
    [47]LOPEZ P.GROBID:combining automatic bibliographic data recognition and term extraction for scholarship publications[C]//Proceedings of the 13th European Conference on Digital Library.Corfu:[s.n.],2009:473-474.
    [48]DAY M Y,TSAI T H,SUNG C L,et al.A knowledge-based approach to citation extraction[C]//Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration.[S.1.]:[s.n.],2005:50-55.
    [49]CHEN C C,YANG K H,KAO H Y,et al.Bib Pro:a citation parser based on sequence alignment techniques[C]//22nd International Conference on Advanced Information Networking and Applications.[S.1.]:[s.n.],2008:1175-1180.
    [50]SHNEIDER A M.Four stages of a scientific discipline;four types of scientist[J].Trends in Biochemical Sciences,2009,34(5):217.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700