网页图像中字符分割技术的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

网页图像中字符分割技术的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research of Character Segmentation Technology in Web Images
作者：彭翔
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：字符分割 ; 字符检测 ; 字符提取 ; 网页图像 ; 直方图
英文关键词：Character Segmentation ; Character Detection ; Character Extraction ; Web Images ; Histogram
学位年度：2008
导师：刘芳
学科代码：081203
学位授予单位：华中科技大学
论文提交日期：2008-06-01

摘要

在Internet上,纯文本页面被加入了越来越多的图片,这些图片中含有大量的字符信息,这些信息不仅可以被传统的文本搜索引擎用来索引和检索页面,而且可以帮助多媒体搜索引擎对图片进行检索。为了增强表现效果,网页图片中的字符常常具有非常丰富的颜色、语言种类、字体以及变化多样的排列方式,且字符尺寸较小。需要在现有的图像字符分割技术的基础上,针对以上特征研究适合网页图像的字符分割技术。
     字符分割通常分为字符区域检测和字符成分提取两个步骤。字符区域检测算法用于定位图像中的字符区域。针对该问题,设计并实现了一种基于边缘特征的检测算法,这类算法不仅对字符的尺寸,颜色,语言种类的变化有较好的鲁棒性,而且运算速度较快。
     现有的字符提取算法一般采用二值化技术,当字符区域内有多种颜色(灰度)的成分时,更合理的方法是将字符区域按颜色(灰度)特征分为多个成分。直方图分割可以用来划分图像灰度空间,因此给出了基于直方图分割的字符提取算法,该算法利用差分直方图所体现的分布变化,准确检测出分割点,结合一些先验知识可以有效地分离出图像中的字符成分。
     当检测到的字符区域含有与字符颜色(灰度)特征相似的非字符成分时,基于直方图分割的字符提取算法获得的结果质量不高,但如果将该方法结合空间位置信息,就可以有效的解决这个问题。因此给出一种基于聚类算法DBSCAN(Density-Based Spatial Clustering of Applications with Noise)的字符提取算法,该算法将图像中字符的提取过程看作是对颜色(灰度)相似且分布密集的像素聚类,在一个类中的像素即构成图像中的一个成分,再通过判定规则确定字符成分,达到字符提取目的。
     相对基于DBSCAN的字符提取算法而言,基于直方图分割的字符提取算法具有时间复杂度低的优势。为了从整体上提高字符分割的效率,需要采用简单规则对字符检测结果进行判断,对于较大的图像区域,可能含有与字符颜色相似的非字符成分,使用基于DBSCAN的提取算法,否则,使用基于直方图分割的提取算法。
     实验分别对字符检测算法、基于直方图分割的字符提取算法、基于DBSCAN的提取算法以及混合提取算法进行了分析。
More and more images are added to pure textual web pages in the Internet, theses images contain plenty of character information that can not only be used by traditional text-based search engine to index and search web pages but also can help multimedia search engine to search images. To make web pages more attractive, characters in web pages may have more affluent color, language type, text style, and flexible text layout; their size may also be quite small. So it is necessary to research character segmentation for web images according to the above features basing on the existing character segmentation technique.
     Character segmentation is usually divided into two steps: character detection and character extraction. Character detection algorithm is used to detect text regions in images. To solve this problem, an edge feature based detection method is designed and implemented in this paper. This kind of method is efficient and robust to variation of character size, color, language type.
     Binarization technique is usually employed by existing character extraction algorithms, a more reasonable method is to category the text region into different components according to their color (gray scale) feature when there are many components of different colors (gray scales). Histogram segmentation can be used to divide the gray scale space of images. So a character extraction method based on histogram segmentation is offered. This algorithm makes use of the distribution variation of difference histogram and can find segmentation points exactly. With the priori knowledge, characters can be effectively extracted.
     When processing non-text component which has similar color (gray scale) feature as character in the detected text regions, histogram segmentation based method can not obtain good result. This problem can be effectively solved when the location information is considered. So, a DBSCAN(Density-Based Spatial Clustering of Applications with Noise)based character extraction method is offered. This method treats the character extraction process in images as clustering those pixels which have similar color (gray scale) and are in one density region. All pixels in one class form a component of images. After using some determinant regulars, characters can be acquired.
     Compared to DBSCAN based character extraction method, histogram segmentation based method is more efficient. To improve the efficiency of the whole character segmentation process, there is a need to use some simple rules to judge the detection result. Large text regions are fed to BSCAN based method because it is more probable that these regions contain color (gray scale)-similar non character components. While small text regions are fed to the histogram segmentation based algorithm.
     The performance of character detection algorithm, histogram segmentation based extraction algorithm, DBSCAN based extraction algorithm and the combination extraction algorithm is analyzed in the experiment part.

引文

[1] Karatzas D, Antonacopoulos A. Text extraction from Web images based on a split-and-merge segmentation method using colour perception. In: Proc. of the 17th Int'l Conf. on Pattern Recognition. Cambridge: IEEE, 2004. 634~637
    [2] Antonacopoulos, A. , Karatzas, D. , Ortiz Lopez, J. Accessing Textual Information Embedded in Internet Images. In: SPIE Internet Imaging II. San Jose, USA : 2001. 198~205
    [3] Lopresti, D. , Zhou, J. Document Analysis and the World Wide Web. In: Workshop on Document Analysis Systems. Marven, Pennsylvania: 1996. 417~424
    [4] C. H. L. T. Kanungo, R. Bradford. What fraction of images on the web contain text. In: Proceedings of the International Workshop on Web Document Analysis. 2001. 218~224
    [5] Lienhart R. Indexing and retrieval of digital video sequences based on automatic text recognition. In: Proceedings of 4th ACM International Multimedia Conference. Boston, MA, USA: 1996. 212~216
    [6] Pfeiffer S, Lienhart R, Fischer S, et a1. Abstracting digital movies automatically. Journal Vision Communication and Image Represent, 1996 , 7(4): 345~353
    [7] S. J. Perantonis, B. Gatos, V. Maragos, et al. Text Area Identification in Web Images. Lecture Notes in Computer science, 2004, 3025: 82~92
    [8] Rainer Lienhart, Axel Wernicke. Localizing and Segmenting Text in Images and Videos. IEEE Transactions on circuits and systems for video technology, 2002, 12(4): 256~268
    [9] Nagy, G, Prateek Sarkar. Document style census for OCR. Document Image Analysis for Libraries. In: Proceedings of First International Workshop on Document image Analysis for Libraries. Troy, NY, USA: IEEE Computer Society Press, 2004 . 134~147
    [10] M. Sarfraz, A. Zidouri, S. A. Shahab. A novel approach for skew estimation of document images in OCR system. In: International Conference on Computer Graphics, Imaging and Vision. New Trends, 2005. 175~180
    [11]王勇,郑辉,胡德文.图像和视频中的文字获取技术.中国图象图形学报, 2004, 9(5): 532~538
    [12]胡小峰,周勇,叶庆泰.复杂背景彩色图像中的文字分割.光学技术, 2006, 32(1): 141~147
    [13] JAIN A K, YU B. Automatic text location in images and video frames. Pattern Recognition, 1998, 31(12): 2055-2076
    [14] V. Wu, R. Manmatha, E. M. Riseman. Finding text in images. In: Proceedings of the 2nd ACM International Conference on Digital Libraries. Philadaphia: 1997. 1~10
    [15] Chunmei Liu, Chunheng Wang, Ruwei Dai. Text Detection in Images Based on Unsupervised Classification of Edge-based Features. In: Proceedings of the 2005 8th International Conference on Document Analysis and Recognition. Berlin: Springer, 2005. 399~405
    [16] M. Celenk. A Color Clustering Technique for Image Segmentation. Computer Vision. Graphics and Image Processing, 1990, 52(2): 145~170
    [17] J. Liu , Y-H. Yang. Multi resolution Color Image Segmentation. IEEE Transactions on PAMI, 1994, 16(3): 689~700
    [18] A. Weeks, G. Hague. Color Segmentation in the HSI Space Using the K-Means Algorithm. In: Proceedings of the SPIE Symposium on Electronic Imaging. San Jose: 1997. 143~154
    [19] Kehtarnavaz, N. Monaco, J. Nimtschek. Color image segmentation using multi-scale clustering. In: 1998 IEEE Southwest Symposium on image analysis and interpretation. 1998. 142~147
    [20] M. M. Haji, S. D. Katebi. An Efficient Text Segmentation Technique Based on Naive Bayes Classifier. GVIP Journal, 2005, 5(7): 21~30
    [21] Xi Jie, Hua Xian-Sheng, Chen Xiang-Rong, et al. , A Video Text Detection and Recognition System, In: Proc. of ICME 2001. Japan: Waseda University, 2001. 1080~1083
    [22] Q. Yuan, C. L. Tan. Page Segmentation and Text Extraction from Grey-Scale Images in Micro Film Format. SPIE Proc. on Document Recognition and Retrieval, 2000, 4(2): 323~332
    [23] Li Hui ping, Doerman D. Kiao. O. Automatic text detection and tracking in digitalvideo.IEEE transactions on image processing, 2000, 9(1): 147~156
    [24] Zhong Y,Karu K, Jain A K.Locating text in complex color images.Pattern Recognition,1995,28(10): 1523~1536
    [25] Zhou Jiangying, Lopresti D. Extracting Text from WWW Images. In: proceedings of the fourth international conference on document analysis and recognition. IEEE, 1997. 248~252
    [26] Sun Jun, Wang Zhulong, Yu Hao, et al. Effective Text Extraction and Recognition for WWW Images. In: Proceedings of the 2003 ACM symposium on Document engineering. NY, USA: ACM, 2003. 115~117
    [27] J. Perantonis, B. Gatos, v. Maragos. A novel Web image processing algorithm for text area identification that helps commercial OCR engines to improve their Web image recognition efficiency. In: Second Workshop on web document analysis. Edinburgh, Scotland: 2003. 61~64
    [28] HE Jiaying, LI Shaofa. Hybrid Chinese/English Text Identification in Web Images. In: Proceedings of the Third International Conference on Image and Graphics (ICIG’04). IEEE, 2004. 361~364
    [29] Wu Jiang, Qu Shao Lin. Automatic text detection in complex color image. In: proceedings of the first International Conference on Machine Learning and Cybernetics. Beijing: 2002. 1167~1171
    [30] Martin Ester, Hans-Peter Kriegel, Jorg Sander, et al. A Density-Based Algorithm for Discovering clusters in large Spatial Databases with Noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996. 226~231
    [31] Ye Qixiang, Wen Gao, Zeng Wei. Color Image Segmentation Using Density-BasedClustering. In: Proceedings of the 2003 IEEE international Conference on Acoustics, Speech and Signal Processing. Hong Kong: 2003. 401~404
    [32] Cheng HengDa, Sun Ying. A hierarchical approach to color image segmentation using homogeneity Image Processing. IEEE Transactions on image processing, 2000, 9(12): 2071~2082
    [33] Gonzalez Rafael C. , Woods Richard E.数字图像处理.第二版.阮秋狄,阮宇智等译.北京:电子工业出版社, 2004. 474~494
    [34] Julie Delon, Agnès Desolneux, José-Luis Lisani, et al. A nonparametric approach for Histogram segmentation. IEEE transactions on image processing, 2007, 16(1): 253~261
    [35] R. Duda, P. Hart, D. Stork. Pattern Classification. 2nd edition. New York: Wiley, 2000. 120~130
    [36] A. Dempster, N. Laird, D. Rubin. Maximum likelihood from incomplete data via EM algorithm. J. Roy. Statistics, 1997, 39(1): 1~38
    [37] C. Chang, Chen K. , Wang J, et al. A relative entropy based approach to image thresholding. Pattern Recognition, 1994, 27(9): 1275~1289
    [38] H. Cheng, Y. Sun. A hierarchical approach to color image segmentation using homogeneity. IEEE Trans. on Image Process, 2000, 9(12): 2071~2082
    [39] H. Wang, D. Suter. False-peaks-avoiding mean shift method for unsupervised peak-valley sliding image segmentation. In: 7th International Conference on Digital Image Computing: Techniques and Applications (DICTA'03). Sydney: 2003. 581~590
    [40] LIU Yangxing, Satoshi GOTO, Takeshi IKENAGA. A Robust Algorithm for Text Detection in Color Images. In: Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition (ICDAR’05). IEEE computer society, 2005. 399~405
    [41] Han Jiawei, K. Micheline.数据挖掘:概念与技术.第一版.北京:高等教育出版社, 2001. 363~365
    [42] D. Lopresti, J. Zhou. Locating and recognizing text in WWW images. Information Retrieval, 2000, 2: 177-206

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700