中文版面分析关键技术的研究

英文题名：The Key Technology Research on Chinese Layout Anaysis
作者：靳从
论文级别：博士
学科专业名称：模式识别与智能系统
中文关键词：文档图像处理 ; 版面分析 ; 倾斜检测 ; 表格识别
英文关键词：document image processing ; layout anaysis ; skew detection ; form recognition
学位年度：2007
导师：杨静宇
学科代码：081104
学位授予单位：南京理工大学
论文提交日期：2007-05-01

摘要

版面分析是版面信息处理系统的重要组成部分，旨在将纸制文档内容转化为电子信息，以便进一步通过版面理解实现版面数字化。版面分析的正确性，直接影响到版面理解的结果，进而决定着版面信息处理系统输出结果的语义关系和逻辑关系是否正确。在各种版面文档中，中文版面以其排版形式的多样化，以及汉字的多笔划等特点，使版面分析远较西文版面为复杂，以致成为当前版面分析技术的瓶颈。因此，对中文版面分析的研究具有重要的理论意义与实用价值。
     版面分析的主要内容在于分析版面的几何结构。由于版面的复杂性，版面分析所涉及的内容非常广泛。不同类型的版面反映的信息不同，版面分析过程所需的处理方法也不同。本文对中文版面分析过程中所涉及的若干关键技术进行了深入的研究，主要包括版面倾斜检测、版面区域分割与识别、版面对象顺序确定，以及表格识别等技术，其中具有创新性的研究成果主要体现在以下几个方面：
     1、基于视窗变换的版面倾斜检测算法
     版面在扫描输入时，不可避免地会发生倾斜现象，以致对后续处理产生影响。为对版面进行倾斜检测与校正，该算法首先选取适当视窗，通过对视窗内容细节部分进行变分辨率处理，提取相关特征点进行直线拟合，达到检测版面倾斜角度的目的。实验结果表明，该方法能快速准确地检测出各类版面的倾斜角度，并具有良好的适应性。
     2、基于版面边缘增强的版面倾斜检测算法
     考虑到版面复杂度对视窗选取效率的影响，本文又提出了一种基于版面边缘增强的版面倾斜检测算法。该算法首先对倾斜的图像利用算子进行处理，得到一个图像块，．该图像块的边界信息能较好的表示原版面的边界信息，然后，用4-方向链码表示该图像块的边界，从图像块中提取近似直线信息。最后，用最小二乘算法进行直线拟合，计算版面的倾斜角度。实验结果表明该算法准确度高、速度快而且与图像的内容无关。
     3、基于层次提取的版面分割与识别
     版面分割与区域识别是将版面进行空间划分，生成若干包含不同数据类型的区域。该算法首先将版面划分为图像、图表和文本等多个层次，先对版面中的图像层和图表层中的主要线段分别进行提取，再利用连通区域法对文本层进行分析，通过文本“模糊”、边缘检测、段落提取、投影周期性的判断，对图形、表格与文本各部分加以区分。可以看出，该算法将版面分割与区域识别相结合，提高了算法的效率。
     4、基于有向图的版面对象顺序确定
     该算法利用版面对象的空间结构建立空间结构有向图，将版面对象之间的顺序确定，转换为在有向图空间进行遍历搜索的过程，通过图的遍历生成遍历树来确定版面对象顺序。实验结果表明该算法有效。
     5、基于面向对象的有向图模型表格识别方法
     该算法首先提取空表格中各对象的特征及属性，建立相应表格模型，再对待识别表格提取特征，采用两级匹配，充分利用其与模型之间特征线及相关特征线的匹配相似度，结合逻辑关系确定表格类型，达到表格识别的目的，从而提高了表格识别的正确率。实验结果表明，该方法具有高效、灵活的特点。
     最后，本文建立一个票据版面分析实验系统，并在此实验系统基础上，对文中所提出的版面倾斜检测、版面分割与识别、版面对象顺序的确立及表格识别等算法进行了相关实验。实验结果表明，本文所提方法，在票据版面分析中，实际应用效果良好，所提方法具有通用性。
Layout analysis is an important part in document layout analysis and understanding. Itis used to transfer content in paper document to electronic digital information for furtherdigitalization of total layout. Out of different kinds of document layouts, Chinesedocument layout is with diversified composition and complicated Chinese characters. Thismakes it more difficult in analyzing Chinese document layout than the layout of otheralphabetic languages. It has been a bottleneck in development of layout analysistechnology currently. Thus, the study of layout analysis is of important theoreticalsignificance and application value.
     Because of the complex of layout, the scope of study object for layout analysis isextremely wide. Different kind of layout refers to different information, which needsdifferent processing method in layout analysis. A number of key technologies of Chineselayout analysis were studied and presented in this dissertation, which are skew detectionand correction, block segmentation and recognition, determination of logical order inlayout and table recognition. The innovational achievements involved these researches areas follows,
     1 layout skew detection algorithm based on window transform
     The scanned layout is with inevitable skew which would cause negative affect onfollow-up processing. A proper window is selected in this algorithm for skew detection andcorrection. The skew detection is achieved by conducting varied resolution processing fordetail content in the window and line fitting of those extracted characteristic points.Experimental results show that this algorithm is with good adaptability and can detect theskew of different layout rapidly and accurately
     2 layout skew detection algorithm based on edge enhancement
     Considering the influence of complicated layout on the efficiency of window selection,another layout skew detection algorithm is put forwards based on edge enhancement. Inthis algorithm, an image block is obtained from processing image by operator. The originaledge information is represented by that of the image block. A 4-direction chain code isused to stand for the edge of this image block. Then approximate line information can beextracted from the image block. Skew angle is calculated by least squares algorithm at last.Experimental results show that this algorithm is accurate and rapid, and independent of thecontent of layout.
     3 layout segmentation and block recognition algorithm based on hierarchy extraction
     Layout segmentation and block recognition is to divide layout into differentgeometrical zones and generates different blocks with different types of data. Firstly, thelayout is segmented into different levels of image, figure and text. The main line segmentis extracted from image level and figure level by mathematical morphology. The textlevel is analyzed by connectivity. Figure, table and text are discriminated by text blurring,edge detecting, paragraph extracting, project periodicity estimating. Layout segmentationand block recognition is combined in this algorithm which improves the processingefficiency.
     4 determination of logical order in layout based on directed graph.
     Space structure directed graph is set up from analysis the space structure of layoutobjects. This transfers the determination of logical order of layout objects into traversingsearch in directed graphs, from which the logical order of layout object is determined. Theefficiency of this method was proved by experiments.
     5 a table recognizing algorithm based on directed graph
     Table model is established by extracting characteristics and attribute of empty table.Feature extraction is conducted for the table under recognizing. Table recognition isachieved by logical relationship and two stage matching which makes use of the matchingsimilarity of feature line between model and the under recognizing table. Thus theaccuracy of recognizing is improved. Experimental results show that this algorithm isflexible and efficiency.
     Finally an experimental system for analysis bill layout is established to valid abovealgorithms, such as skew detection and correction, layout segmentation and blockrecognition, determination of logical order in layout and table recognizing algorithm.Experiment results illustrate that these algorithms are effective and universal in analyzingthe image of bill.

引文

[1] 陈曦．纸质信息媒体与电子出版物的比较研究．合肥工业大学学报(社会科学版)[J]．2005(10)：140-143．
    [2] 郭志坤．纸质出版物将与人类同行．编辑学刊[J]．2006(6)：4-8．
    [3] Mori S., Suen C. Y., Yamamoto K. Historical Review of OCR Research and Development[C], Proc. IEEE, 1992, 80(7): 1029-1057.
    [4] Govindan V. K., Shivaprasad A. P. Character Recognition A Review[J], Pattern Recognition, 1990, 23(7): 671-683.
    [5] A.C. Downton, C. G. Leedham. Preprocessing and Presorting of Envelope Images for Automatic Sorting Using OCR[J]. Pattern Recognition, 1990, 23(3/4): 347-362.
    [6] D. Dori. Dimensioning Analysis: Toward Automatic Conversion of Engineering Drawings. Communications of the ACM[J], 1992, 35(10): 92-103.
    [7] H. Bunke. Attributed Programmed Graph Grammars and Their Application to Schematic Diagram Interpretation[C]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1982, 4(6): 574-582.
    [8] G. R. Thoma. Automating Data Entry for an Online Biomedical Database: A Document Image Analysis Application[C]. ICDAR'99: 370-373.
    [9] J. D. Hobby. Page Decomposition and Signature Finding via Shape Classification and Geometric Layout[C]. ICDAR'99: 555-558.
    [10] D. Blostein, H. Baird. A Critical Survey of Music Image Analysis In Structured Document Image Analysis[C]. Springer, Heidelberg, 1992: 405-434.
    [11] C. Fan, X. Ye, W. Gu. KRUS. A Knowledge-based Road Scene Understanding System[C]. In Proceedings of International Conference on Pattern Recognition, Brisbane, Australia, Aug. I6-20, 1998: 731-733.
    [12] Tang Y. Y., Suen C. Y. Yah C. D., cherlet M. Document Analysis and Understanding: A brief Survey[C]. Proc 1st ICDAR, 1991: 17-31.
    [13] G. Nagy, S. C. Seth. Hierarchical Representation of Optically Scanned Document[C]. 7th ICPR, 1984: 347-349.
    [14] D. Sylwester, S. Seth, A Trainable, Single-Pass Algorithm for Column Segmentation[C]. Proc. 3rd ICDAR, Montreal, Canada, 1995: 615-618.
    [15] M. Krishnamoorthy, G. Nagy, S. Seth, etc. Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals[C]. IEEE Trans. on PAMI, 1993, 15(7): 734-747.
    [16] G. Nagy, S. C. Seth, S. D. Stoddard. Document Analysis with an Expert System[J]. Pattern Recognition in Practice Ⅱ(E. S. Gelsema and C. N. Kanal, Eds), 1986: 147-159.
    [17] F. M. Wahl, K. Y. Wong, R. G. Casey, Block Segmentation and Text Extraction in Mixed Text/Image Documents[J], Computer Graphics and Image Processing, 1982, 20: 375-390.
    [18] L. O'Gorman. The Document Spectrum for Page Layout Analysis[C]. IEEE Trans. On PAMI, 1993, 15(11): 1162-1173.
    [19] 陈明，丁晓青．复杂中文报纸的版面分析理解和重构．清华大学学报[J]，2001，41(1)：29-32．
    [20] 张志彬．中文版面分析的研究[D]．硕士学位论文．河北：河北大学，2002．
    [21] Jiming Liu, Yuan Y Tang, Ching Y Suen. Chinese Document Layout Analysis Based On Adaptive Split-and-Merge and Qualitative Spatial Reasoning[J]. Pattern Recognition. 1997, 30(8): 1265-1278.
    [22] Anil K. Jain, Sushil Bhattacharjee. Text Segmentation Using Gabor Filters for Automatic Document Processing[J]. Machine Vision and Applications, 1992, 5: 169-184.
    [23] Anil K. Jain, Yu Zhong. Page segmentation using texture analysis[J]. Pattern Recognition, 1996, 29(5): 743-770.
    [24] COHEN D. Automatic Text Summarization[EB/OL]. http://www.cs.tau.ac.il/%7Enachumd/NLP/Summarization.pdf, 2006.
    [25] Khan. A, Khan. S, Mahmood. W. MRST: ANew Technique for Information Summarization[C]. Transactions on Engineering, Computing and Technology V4 February. 2005. 1305-5313.
    [26] S. Tsujimoto, H. Asada. Understanding Multi-articled Documents[C]. In Proceedings of International Conference on Pattern Recognition, Atlantic City, New York, June 16-21, 1990: 551-556.
    [27] A. Hashizume, P. S. Yeh, A. Rosenfeld. A Method of Detecting the Orientation of Aligned Components[J]. Pattern Recognition Letters, 1986, 4(1): 125-132.
    [28] Y. Parsons. Introduction to Compiler Construction[C]. W. H. Freeman and company, 1992.
    [29] D. Niyogi, S. N. Srihari. Knowledge-Based Deviation of Document Logical Structure[C]. ICDAR'95: 472-475.
    [30] George Nagy. Twenty Years of Document Image Analysis in PAMI[C]. IEEE Trans. On Pattern Analysis and Machine Intelligence. 2000, 22(1): 38-82.
    [31] K. Y. Wong, R. G. Casey, F. M. Wahl. Document Analysis System[J]. IBM Journal of Research and Development, 1982, 26(6): 647-656.
    [32] T. Akiyama, N. Hagita. Automated Entry System for Printed Documents[J]. Pattern Recognition, 1990, 23(11): 1141-1154.
    [33] Jie Zou, Daniel Le, George R. Thoma, Online medical journal article layout analysis[C]. Proceedings of SPIE Volume 6500, Document Recognition and Retrieval XIV, Xiaofan Lin, Berrin A. Yanikoglu, Editors, 65000V (Jan. 29, 2007).
    [34] T. C. Komakai, K. saiwai. Document skew detection based on local region complexity[C]. Proc. 2nd. Int. Conf. Document Analysis and Recognition, 1993: 9-42.
    [35] K. Toshiba-cho, S. Ku. Document skew detection based on local region complexity[C]. Proc. IEEE. 1993: 125-132.
    [36] G. S. Lehal, R. Rhir. A range free skew detection technique for digitized Gurmukhi script documents[C]. Proc. 5th. Int. Conf. Document Analysis and Recognition, 1999: 147-152.
    [37] J. Illing Worth, J. Kittler. A Survey of the Hough Transform[C]. Compute Vision and Image Process. 1988, 44: 87-116.
    [38] V. F. Leavers. Which Hough Transform[C]. Compute Vision and Image Process. 1993, 58: 286-290.
    [39] Saitoh T, Pavlidis T. Page segmentation without rectangle assumption[C]. In: Proceedings of the 11th International Conference on Pattern Recognition. SaintMalo, France. 1991.
    [40] 田学东，郭宝兰．基于组合特征的中文版面分析．中文信息学报[J]，1999，13(4)：22-27．
    [41] B. Gatos, N. Papamaikos, C. Chamzas. Skew Detection and Text Line Position Determination in Digitized Documents[J]. Pattern Recognition. 1997, 30(9): 1505-1519.
    [42] S. Chen, R. M. Haralick. An Automatic Algorithm for Text Skew Estimation in Document Images Using Recursive Morphological Transforms[C]. In Proc. of the 1st IEEE International Conference on Image Processing. Austin. Texas. 1994(7): 139-143.
    [43] 周长岭．中文OCR中的版面分析算法初探[C]．第六届全国汉字识别学术会议论文集，重庆，1996：137-1420．
    [44] 姜哲，夏莹．中文版面分析技术[C]．第六届全国汉字识别学术会议论文集．1996：131-1360．
    [45] 田学东，郭宝兰．汉字识别系统中的版面分析方法[J]．微机发展．1999(1)：8-9．
    [46] 陈自利．基于小波与神经网络的文字识别系统研究[D]．博士学位论文，重庆：重庆大学，1999．
    [47] 左孝凌等．离散数学[M]．上海：上海科技文献出版社．1982．
    [48] 谢凤英，姜志国，汪雷．基于空白条方向拟合的复杂文本图像倾斜检测[J]．计算机应用．2006，26(7)：1597-1589．
    [49] 李庆峰，付忠良，刘琴．一种高效的倾斜图像校正方法[J]．计算机工程．2006，32(21)：194-196．
    [50] 明底烈，柳健．小角度倾斜图像的倾斜快速检测和校正方法[J]．华中理工大学学报 2000，28(5)：66-68．
    [51] 王姝华，李佐，蔡士杰．基于连续性的页面倾斜检测与校上E[J]．计算机辅助设计与图形学报．2001，13f8)：735-739．
    [52] N. Liolios, N. Fakotakis, Cckokkinakis. On the Generalization of the Form Identification and Skew Detection Problem[J]. Pattern Recognition. 2002. 35(10): 253-264.
    [53] Yi-Kai Chen. Jhing-Fa. Wang Skew detection and reconstruction based on maximization of variance of transition-counts[J]. Pattern recognition. 2000(33): 195-208.
    [54] HK Kwag. S. H. Kim. S. HJeong. G. S. Lee. Efficient Skew Estimation and Correction Algorithm for Document Images[J]. Image and Vision Computing 2002(20): 25-35.
    [55] H. Yan. Skew Correction of Document Images Using Interline Cross-Correlation[J]. CVGIP: Graphical Models and Image Processing. 1993, 55(6): 538-543.
    [56] Yue Lu, Chew Lira Tan. Improved Nearest Neighbor Based Approach to Accurate Document Skew Estimation[C]. Proceeding of the Seventh International Conference on Document Analysis Recognition(ICDAR'03) 2003 IEEE.
    [57] Srihari. S. N, Govindaraju V. Analysis of Textual Images Using the Hough Transform[J]. Machine Vision and Application. 1989, 2: 141-153.
    [58] Strauss O. Use the Fuzzy Hough Transform Towards Reduction of the Precision/Uncertainty Duality[J]. Pattern Recognition. 1999, 32: 1911-1922.
    [59] 陈优广．边界跟踪_区域填充及链码的应用研究[D]．博士学位论文，上海：华东师范大学，2007．
    [60] 李政，杨扬，颉斌，王宏．一种基于Hough变换的文档图像倾斜纠正方法[J]．计算机应用．2005，25(3)：583-585．
    [61] 吕亚军，陈继荣，鹿晓亮．基于内容的文档图像倾斜校正[J]．计算机仿真．2006，23(12)：192-196．
    [62] 潘武模．模型序列方法与文档版面结构理解[D]．博士学位论文．天津：南开大学．2001．
    [63] 王姝华．文档分析与理解中若干技术的研究[D]．博士学位论文．南京：南京大学．2001．
    [64] C. Strouthopoulos, N. Papantatkos. Text Identification for Analysis Using a Neural Network[J]. Image and Vision Computing, 1999, 16: 879-896.
    [65] S. L. Taylor, M. Lipshutz, C. Weir. Document Structure Interpretation by Integrating Multiple Knowledge Sources[J], Symposium on Document Analysis and Information Retrieval, UNLV, USA, 1992: 58-76.
    [66] A. K. Jain, B. Yu. Document Representation and Its Application to Page Decomposition[C]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, 20(3): 294-308.
    [67] A. K. Jain, B. Yu. Page Segment Using Document Model. ICDAR'97: 34-38.
    [68] A. Simom J. C. Pret, A. P Johnson. A Fast Algorithm for Bottom-Up Document Layout Analysis[C]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(3): 273-277.
    [69] T. Pavlids, J. Zhou. Page Segmentation and Classification[J]. CVGIP: Graphical Models and Image Processing, 1992, 54(6): 484-496.
    [70] F. Y Shih, S. Chen. Adaptive Document Block Segmentation and Classification[C]. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 1996. 26(5): 797-842.
    [71] 史广顺．文档图像中表格结构的自动定位与分析[D]．博士学位论文．天津：南开大学，2003．
    [72] F. Cesarini, M. Gori, S. Marinai, G. Soda. INFORMys. A Flexible Invoice-Like Form Reader SystemiC]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(7): 730-745.
    [73] A. Dengel, F. Dubiel. Logical Labeling of Document Images Based on Form Layout Features[C]. In Proceedings of Workshop on Document Image Analysis, 1997: 26-31.
    [74] R. G. Casey, D. R. Ferguson. Intelligent Form Processing[J]. IBM System Journal, 1990, 29(3): 435-450.
    [75] 贾娟，亓文法，侯晓辉，陈堃銶．基于不规则版面布局模型的区域划分和分区排序算法[J]．计算机工程与应用．2003(30)：51-53．
    [76] C. Cracknell, A. C. Downton. Document Image Understanding of Handwritten Forms Using Rule-trees[C]. In Proceedings of International Conference on Pattern Recognition, Brisbane, Australia, Aug. 16-20, 1998: 936-938.
    [77] YY Tang, C. Y Such, C. D. Yan, M. Cheriet. Financial Document Processing Based on Staff Line and Description Language[C]. IEEE Transactions on Systems, Man, Cybernetics, 1995, 25(5): 738-754.
    [78] Li, D. Doermann, WG. Oh, W. Gao. A Robust Method for Unknown Forms Analysis[C]. ICDAR'99: 531-534.
    [79] J. L. Chen, H. J. Lee. An Efficient Algorithm for Form Structure Extraction Using Strip Projection[J]. Pattern Recognition, 1998, 31(9): 1353-1368.
    [80] 骆春妹．表格结构自动处理的方法研究[D]．硕士学位论文．天津：南开大学，2004．
    [81] 华云，纪林．百家姓解释[M]．北京：人民邮电出版社，1980．
    [82] C. Zhang, P. Wang. A New Method of Color Image Segmentation Based on Intensity and Hue Clustering[C]. Proceedings of 15th International Conference on Pattern Recognition. IEEE Computer Society Press, 2000: 617-620.
    [83] 卜飞宇．表格识别系统应用中若干问题的研究[D]．硕士学位论文．北京：中国科学院软件研究所，2004．
    [84] H. S. Baird. The Skew Angle of Printed Documents[C]. Proceedings of the SPSE Fortieth International Symposium on Hybrid Imaging Systems. New York. 1987: 21-24.
    [85] 王海琴，戴汝为．基于投影和递归的版面理解算法[J]．模式识别与人工智能．1997，10(2)：118-126．
    [86] 田学东，李新福，郭宝兰．印刷文档中表格字符的自动提取算法[J]．河北大学学报．2001， 21(1): 90-93.
    [87] Kuo-Chin Fan, Kuan-Kai Wang, Mei-Lin Chang. Form Document Identification Using Line Structure Based Features[C]. IEEE. 2001: 704-708.
    [88] Kuo-Chin Fan, Jeng-MingLu, Liang-Sheng Wang, Hong-Yuan Liao. Extraction of Characters from form documents by feature point clustering[J]. Pattern Recognition Letters. 1995, 16: 963-970.
    [89] A. Antonacopoulos, R. T. Ritchings. Segmentation and Classification of Documents[C]. IEE Colloquium on Document Image Processing and Multimedia Environment, 1995.
    [90] O. Altamura, F. Esposito, D. Malerba. WISDOM++: An Interactive and Adaptive Document Analysis System[C]. ICDAR'99: 366-369.
    [91] 丁凰．表单图像版面分析方法研究[D]．硕士学位论文．西安：西安科技大学，2006．
    [92] 郭丽．文档版面分析的研究[D]．硕士学位论文．南京：南京理工大学，2000．
    [93] 魏之来．页面倾斜检测与版面分析算法的研究[D]．硕士学位论文．南京：南京理工大学，2004．
    [94] 汤英．面向对象的表格图像版面分析方法研究[J]．华中科技大学学报．2005，33(12)：82-84．
    [95] 靳从，唐振民，杨静宇．自动标引中自然主题词的切分．情报科学．2004，22(3)，337-339

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700