语法分析与纠错相结合的文档结构重构方法

英文篇名：A method combining syntax analysis and correction rules to re-construct streaming document
作者：张真 ; 李宁 ; 田英爱 ; 耿思 ; 许洁
英文作者：ZHANG Zhen;LI Ning;TIAN Ying'ai;GENG Si;XU Jie;Computer School,Beijing Information Science & Technology University;China Electronics Standardization Institute;
关键词：流式文档 ; 结构重构 ; 容错处理 ; 左角分析方法 ; 纠错规则
英文关键词：streaming document;;document structure re-construction;;fault-tolerance;;left-corner analysis method;;error correction rules
中文刊名：BJGY
英文刊名：Journal of Beijing Information Science & Technology University
机构：北京信息科技大学计算机学院;中国电子技术标准研究院;
出版日期：2019-04-15
出版单位：北京信息科技大学学报(自然科学版)
年：2019
期：v.34;No.128
基金：国家自然科学基金资助项目(61672105);; 国家高科技研究发展计划(863计划)(2015AA015403);; 国家重点研发计划项目(2018YFB1004100)
语种：中文;
页：BJGY201902007
页数：6
CN：02
ISSN：11-5866/N
分类号：32-37

摘要

对于传统方法在处理结构不规范的流式文档时的不足,提出一种左角分析和纠错规则相结合的新的流式文档结构重构方法。使用XML Schema构造文档逻辑构件的排版规则语法树;在排版规则语法树引导下,利用左角分析方法分析文档的逻辑构件,实现文档结构的重构;结合纠错规则对文档中的错误进行判断和改正,保证文档结构重构过程的顺利进行,得到最佳的流式文档结构。实验结果表明,所提方法在流式文档结构重构过程中的容错能力和识别准确率均优于其它算法。
To improve fault-tolerance ability to reconstruct streaming document structure,a new method combining left-corner method and correction rules is proposed,where the XML Schema is applied to construct a syntax tree of typesetting rules of document components. Left-corner method is applied to analyze the logical components of the document supervised by the syntax tree. In the analysis process,the correction rules are used to correct the possible errors existed in document component and eventually the most likely document structure is gained. The results show that the algorithm can effectively improve the fault tolerance and recognition accuracy in the document structure reconstruction,which forms the foundation for document understanding and format checking.

引文

[1]李宁,梁琦,施运梅.格式信息在文档理解中的作用[J].北京信息科技大学学报,2012,27(06):1-7.
    [2]赵林,李宁,彭欣.基于有向图的流式文档逻辑结构重构方法[J].计算机工程与设计,2016,37(5):1239-1244.
    [3]宋昊苏,李宁,张伟.VSM模型在文档结构识别中的应用[J].北京信息科技大学学报,2011,26(6):66-69.
    [4] Lei Y,Tian Y,Li N,et al.Document structure identification method based on conditional random field[C]//International Conference on Mechatronics,Control and Materials.2016.
    [5] Wu Z,Mitra P,Giles C L. Table of contents recognition and extraction for heterogeneous book documents[C]//International Conference on Document Analysis and Recognition. IEEE,2013:1205-1209.
    [6] Zhuang T,Zong C.A minimum error weighting combination strategy for chinese semantic role labeling[C]//International Conference on Computational Linguistics.2010.
    [7]张涛.中文文本自动校对系统设计与实现[D].成都:西南交通大学,2017.
    [8]刘亮亮,曹存根.中文“非多字词错误”自动校对方法研究[J].计算机科学,2016,43(10):200-205.
    [9]刘亮亮,王石,王东升,等.领域问答系统中的文本错误自动发现方法[J].中文信息学报,2013,27(3):77-83.
    [10]张仰森,郑佳.中文文本语义错误侦测方法研究[J].计算机学报,2017,40(4):911-924.
    [11]李娟.基于模板的文档排版格式检查方法研究[D].北京:北京信息科技大学,2012:25-30
    [12] Chagheri S,Calabretto S,Roussey C,et al.Feature vector construction combining structure and content for document classification[C]//International Conference on Sciences of Electronics,Technologies of Information and Telecommunications.IEEE,2013:946-950.
    [13]宗成庆.自然语言处理综论[M].北京:清华大学出版社,2013:184-190
    [14]吴安迪.左角句子分析器与中心语驱动句子分析器[J].当代语言学,1993(2):7-15.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700