基于机器视觉的PDF学术文献结构识别

英文篇名：Structural Recognition of PDF Academic Literature Based on Computer Vision
作者：于丰畅 ; 陆伟
英文作者：Yu Fengchang;Lu Wei;School of Information Management, Wuhan University;
关键词：PDF ; 学术文献 ; 机器视觉 ; 结构识别
英文关键词：Portable Document Format(PDF);;academic literature;;computer vision;;structural recognition
中文刊名：QBXB
英文刊名：Journal of the China Society for Scientific and Technical Information
机构：武汉大学信息管理学院;
出版日期：2019-04-24
出版单位：情报学报
年：2019
期：v.38
语种：中文;
页：QBXB201904006
页数：7
CN：04
ISSN：11-2257/G3
分类号：54-60

摘要

PDF格式在电子学术文献出版发行领域占有极其重要的地位,但因其复杂的技术规则,使得PDF无法直接被机器阅读,给针对学术文献的研究工作造成了诸多不便。本文提出了一种基于机器视觉的PDF文档结构识别方法,该方法针对常见的PDF学术论文,将PDF文件中的视觉对象和文本对象进行映射,获得内容对象的几何属性和文本属性,并辅以启发式算法对内容对象进行类型判断,得到PDF文档的物理结构和逻辑结构。该方法以直观的方式克服了其他PDF解析方法需要大量人工特征构建或大规模语料训练、难以识别公式表格等缺点,并成功地对ACL (Association for Computational Linguistics)的论文集进行了结构识别和全文抽取。
Portable Document Format(PDF) documents play an important role in the publication of academic electronic literature. However, owing to the technical and structural complexities of PDF documents, they cannot be directly read by digital devices, which in turn can hinder research studies based on academic electronic literature. Hence, this paper proposes a method based on computer vision for the structural recognition of PDF documents. The proposed method, supplemented by a heuristic algorithm, maps graphic objects and text objects present in the PDF files of academic documents and thereby obtains geometric and text attributes of the file objects. The proposed algorithm can identify the category of a PDF object for determining the physical and logical structures of a PDF document. Conventional PDF analysis methods require a significant amount of artificial feature construction and large-scale lexical corpus training and cannot identify formulae and tables. The proposed method can overcome the aforementioned shortcomings and can successfully perform full-text extraction and structural recognition of ACL data collections.

引文

[1] Mao S, Rosenfeld A, Kanungo T. Document structure analysis algorithms:a literature survey[C]//Document Recognition and Retrieval X. International Society for Optics and Photonics, 2003,5010:197-208.
    [2] Nagy G, Seth S, Viswanathan M. A prototype document image analysis system for technical journals[J]. Computer, 1992, 25(7):10-22.
    [3] Baird H S, Jones S E, Fortune S J. Image segmentation by shapedirected covers[C]//Proceedings of the 10th International Conference on Pattern Recognition. IEEE, 1990:820-825.
    [4] O’Gorman L. The document spectrum for page layout analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(11):1162-1173.
    [5] Kise K, Sato A, Iwata M. Segmentation of page images using the area Voronoi diagram[J]. Computer Vision and Image Understanding, 1998, 70(3):370-382.
    [6] Wahl F M, Wong K Y, Casey R G. Block segmentation and text extraction in mixed text/image documents[J]. Computer Graphics and Image Processing, 1982, 20(4):375-390.
    [7] Pavlidis T, Zhou J Y. Page segmentation and classification[J]. CVGIP:Graphical Models and Image Processing, 1992, 54(6):484-496.
    [8] Chen K, Seuret M, Liwicki M, et al. Page segmentation of historical document images with convolutional autoencoders[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition. IEEE, 2015:1011-1015.
    [9] Chen K, Seuret M, Hennebert J, et al. Convolutional neural networks for page segmentation of historical document images[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. IEEE, 2017, 1:965-970.
    [10] Constantin A, Pettifer S, Voronkov A. PDFX:fully-automated PDF-to-XML conversion of scientific literature[C]//Proceedings of the 2013 ACM Symposium on Document Engineering. New York:ACM Press, 2013:177-180.
    [11] Yildiz B, Kaiser K, Miksch S. pdf2table:A method to extract table information from PDF files[OL]. http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.94.9382.
    [12] Clark C, Divvala S. PDFFigures 2.0:Mining figures from research papers[C]//Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. New York:ACM Press,2016:143-152.
    [13] Al-Zaidy R A, Giles C L. A machine learning approach for semantic structuring of scientific charts in scholarly documents[C]//Proceedings of the Twenty-Ninth AAAI Conference on Innovative Applications. Palo Alto:AAAI Press, 2017:4644-4649.
    [14] Siegel N, Lourie N, Power R, et al. Extracting scientific figures with distantly supervised neural networks[C]//Proceedings of the18th ACM/IEEE Joint Conference on Digital Libraries. New York:ACM Press, 2018:223-232.
    [15]王津涛,康晓东,李玫,等. PDF文件中可识别图像的提取[J].计算机工程与设计, 2006, 27(9):1539-1541.
    [16] Tsujimoto S, Asada H. Understanding multi-articled documents[C]//Proceedings on 10th International Conference on Pattern Recognition. IEEE, 1990, 1:551-556.
    [17] Yamashita A, Amano T, Takahashi I, et al. A model based layout understanding method for the document recognition system[C]//Proceedings of the International Conference on Document Analysis and Recognition, Saint-Malo, France, 1991:130-138.
    [18] Ramesh S H, Dhar A, Kumar R R, et al. Automatically identify and label sections in scientific journals using conditional random fields[C]//Proceedings of Conference on Semantic Web Evaluation Challenge. Cham:Springer, 2016, 641:269-280.
    [19] Fauconnier J P, Kamel M. Discovering hypernymy relations using text layout[C]//Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Stroudsburg:The Association for Computational Linguistics, 2015:249-258.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700