汉字字形形式化描述方法及应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在汉字信息处理领域,现有的各种汉字字形形式化描述方法主要以文字研究和汉语教学研究中描写汉字形体结构的结构分析法为基础,采用人认知的结构类型、部件、笔画等构形单位对汉字字形进行分层描述。这些方法在字形拆分规则、结构类型划分、描述基元选取等方面存在着歧义和描述缺失,无法满足统一描述各种汉字(包括错字、古籍异体字、民俗拼合字)字形的需要,也无法支持字形自动比对计算处理,不能满足以字形比对计算分析为基础的各种应用需要,如教学研究中错字描述及偏误定量分析、古籍字形描述及比对分析、数字图书中生僻字形检索等。
     基于统计机器学习的汉字识别模型,对事先无法收集样本的错字、异体字、拼合字等特殊汉字,由于没有训练样本可学习,无法支持这类汉字的分类计算。对于可收集训练样本的一般汉字,识别模型中采用的字形统计特征难以逻辑解析来与人认知的字形结构类型、部件、笔画建立对应关系,是一种“黑盒”字形描述模型,无法支持面向人的各种字形比对分析应用需要。
     上述问题归结为汉字缺少统一有效的字形形式化描述和字形比对计算方法。本文工作围绕这一核心问题展开,面向字形比对分析应用建立了一种汉字字形描述方法及一组相关的字形比对算法和实用工具。主要创新性工作包括:
     1)提出一种笔段网格汉字字形形式化描述方法,用预先定义好长度、方向的直线段——笔段作为描述字形的基元,基元颗粒度适当、规范、无歧义,能统一描述一切可能今文字(包括错字、异体字、拼合字)字形骨架的异同。论证实验表明,这种方法与相同基元量点阵字形相比,描述同一汉字所需的有效基元更少,字形比对计算效率更高;描述不同汉字的字形间区分度大,有利于提高字形比对计算的准确性和可靠性,具有较高的性能代价比。
     2)基于笔段网格字形描述方法,本文进一步提出一组字形比对算法。其中,笔段上下文字形比对算法,以笔段为比对单位,在GB2312字符集汉字和部分错字、异体字上的测试实验表明,算法无需进行训练就能比对字形相似性,字形相似性比对结果受汉字结构类型、笔画划分影响小,在输入字形和比对字形网格大小一致时比对准确率可达100%;基于笔段组合的字形比对算法,在笔段网格字形描述基础上,能自动提取简单笔画、复合笔画,既能按简单笔画为单位进行字形比对,也能按复合笔画、简单笔画自适应进行字形比对。在同样测试汉字集上实验表明,基于简单笔画和复合笔画的字形比对算法无需训练就能进行字形相似度比对计算,比对结果对输入字形整体大小变化、斜笔画不同变形的敏感性降低,对依照约束描画的结构规范字形,比对准确率很高,可达到100%;比对单位大,比对效率高,可以适应大规模汉字字形的比对、查找;比对单位容易与人认知的构字单位建立对应关系,是一种“白盒”字形相似度比对计算方法,既适用整体字形比对,也适用局部字形比对,对结构比例失调较大的不规范字形能发现与结构规范字形的差异性,适合面向字形分析的应用需要。此外,建立了基于笔画关系矩阵的汉字结构关系描述和计算方法,可用于支持汉字结构类型的自动判别。
     3)由于汉字部件在汉字形体结构研究中的重要性,本文提出了在笔段网格描述的简单笔画上,附加组合关系标注的部件描述方法及部件自动发现算法,实验表明,该算法能很准确发现包含特定部件的汉字,而不受部件在字形中位置和大小的影响。
     4)本文还改进了《汉字信息字典》的汉字结构描述体系,提出了基于结构描述的字形相似度比对算法,实验表明,该法找到的相似字结构类型一致性好,与人认知的相似字吻合度较高(96%以上),适合结构类型划分无歧义汉字的相似性计算。
     5)本文最后设计实现了一个实用软件系统——汉字字形描述和自动比对分析工具,采用大众化手写描画方法来建立笔段网格字形描述,可以输入各种可以想见的汉字,包括错字、异体字和拼合字及其它相关信息,能自动将笔段网格字形转换成对应TrueType字模,与标准字符集内汉字一样被处理。对笔段网格字形可以自动进行整字、局部的字形比对,找出按相似度大小排序的相似字。采用这一工具完成了GBK字符集20902个汉字及北京语言大学留学生错字的描述,字形库应用于汉字教学错字偏误分析。
     这些工作有益于汉字字形描述的标准化,在基于汉字字形计算的各种应用领域:如标准字符集外汉字的输入、我国数字图书馆建设、汉语教学研究和国际推广、汉字文化历史研究、社会管理信息化等具有应用前景。
In the field of Chinese characters information processing, the present approaches to the formal description of Chinese character glyph are mostly base on structure analysis method used for describing the topography of Chinese characters in the research on Chinese characters and teaching of Chinese, where strategic descriptions are adopted by applying the human perceptive units, viz. glyph formation units such as types of structure, components and strokes. These methods result in ambiguities and description deficiency with regard to glyph resolution, structure classification, and selection of descriptive elements, therefore they can not meet the need to describe any possible glyph skeletons (including wrongly written characters, variant forms of characters in ancient literatures, and combined-characters), nor can they support automatic computation of glyph comparison, let alone to meet the practical need based on glyph comparison and analysis, such as the description of wrongly written characters or the quantitative analysis of misused characters in the teaching and research of Chinese characters, the description and analysis of variant forms of characters in ancient literatures, or the retrieval of rare character glyphs in the electronic books and so on.
     For special Chinese characters the glyph samples of which can not be collected in advance, such as wrongly written ones, variant forms in ancient literatures, and combined-characters, since no sample training can be done, comparative computation of the glyph cannot be supported and the recognition and identification of them cannot be guaranteed. It would also be difficult for the glyph features generated by statistics, which are adopted by recognition models, to logically resolve and map to the structure types of characters, components and strokes derived from human cognition. They are rather blackbox-like, and they do not meet the demand to human-oriented comparison and analysis of different types of glyph.
     With regard to the core issue of the lack of universally accepted effective means of the formal description and automatic glyph comparison computation of Chinese character glyph, this paper, oriented from the application of comparison and analysis of Chinese character glyphs, offers a new approach to describing them and provides a set of algorithms of related character glyphs comparison and some practical tools. The main innovative includes:
     1) A method is offered formally describe Chinese characters by a stroke-segment-mesh, which uses a line-segment of pre-defined length and direction as a glyph description element (stroke segment). Since it is equipped with suitable granular degree, free of ambiguity, and standardized, it can describe the glyph skeleton of all Chinese characters (including wrongly written characters, variant forms of characters in ancient literatures, and combined-characters). Experiments show that, compared with dot-matrix glyph, which have the same amount of element, the number of effective elements reduces a great deal in the stroke-segment-mesh glyph description, and yet a higher efficiency is achieved. What’s more, the accuracy and reliability of computation are improved thanks to a higher discrepancy degree between different Chinese character stroke-segment-mesh glyphs.
     2) Based on stroke-segment-mesh Chinese characters formal description method, a set of glyph comparing algorithm is presented. The algorithm of glyph comparing by stroke-segment and its context uses stroke-segment as comparing unit. The experiments on the GB2312 character set and some wrongly written characters, variant forms of characters, and combined-characters show that the results of glyph similarity comparing are less affected by the factors such as character structure types and strokes division. Free of training,the algorithm can compare character glyphs, and has a high rate of accuracy when the input character is basically the same size as the compared one. The algorithm of glyph comparing by the combination of stroke-segments, based on the stroke-segment-mesh, can automatically extract simple strokes, compound strokes. It uses simple strokes, or compound strokes and simple strokes adaptively as comparing unit. Experiments on the same character set of Chinese show that the algorithms based on simple stroke and compound strokes can also compute the similarity between character glyph without training, and the result is less subject to the size and different deformation of inclined strokes. The algorithms enjoy a high accuracy rate (nearly 100%) when choosing the first candidate from input glyphs of normal structure. The algorithms use bigger glyph comparing unit and can be applied for large-scale Chinese characters glyph searching with high efficiency. The comparing unit adopted can be easily mapped to the units in human cognition, and it is a"white-box" approach to glyph similarity computation. The method can be applied to the comparison of an entire Chinese character or part of it. It can find the differences between characters of non-standard structure with standardized structure characters, and therefore it can meet the needs of glyph-analysis-oriented application.
     The description and computation method of the structure relationship, based on the relationship matrix of strokes, are also provided, which can be used for the automatic identification of structure types of Chinese characters.
     3) With regard to the importance of components of Chinese characters in the research of physical structure of them, a component description method and the algorithm of automatically detecting components are attached to simple strokes of stroke-segment-mesh glyph. Experiments show that the algorithm can accurately detect the Chinese characters that have specific components, free from the influence of the location and the size of the components in the glyph.
     4) This paper also improves the description system of Chinese character structure of "Chinese character information dictionary", offering an algorithm for the calculating glyph similarity of Chinese characters based on structure description. The experiment results show that the similar character lists found by this algorithm have a high degree of consistence on structure and conform to human cognition. Therefore, the algorithm is suitable for similarity calculation of Chinese characters of definite structure classes.
     5) In this paper, an application software system– Toolkit of Chinese Character Glyph Description and Automatic Comparison and Analysis is designed and implemented, The tool creates a stroke-segment-mesh glyph description by popular hand-written and drawing method. Any imaginable Chinese characters can be put in, including wrongly written characters, variant forms of characters in ancient literatures, combined-characters, and other related information. The stroke-segment-mesh glyph can be automatically transformed to corresponding TrueType font, and processed just like those in the set of standard Chinese character. The tool can make a comparison among stroke-segment-mesh glyphs and find their similarities and differences as a whole or as part, and can find a similar character lists sorted by similarity. The work of creating 20,902 Chinese characters stroke-segment-mesh glyph description in GBK character sets and wrongly written characters written by foreign students studying in Beijing Language and Culture University has been completed by this tool. The Chinese characters glyph database has been applied to the analysis of spelling errors made by foreign students.
     The work will benefit the standardization of Chinese character glyph description and will found wide application in various fields based on Chinese character glyph computing, such as the input of Chinese characters outside of the standard character set, the construction of digital libraries in China, the research, the teaching, and international promotion of Chinese, the research into the history of Chinese characters and culture, the informationalized social management, etc.
引文
1王宁主编.《汉字构形史丛书》总序.上海教育出版社,2005
    2 http://lyzy.dragoninfo.cn/北京龙戴特信息技术有限公司龙与汉堂字源数据库网站
    3 http://www.xiaoyaobi.com/北京逍遥笔模式识别工作站网站
    4 His-Jian Lee, Hung-Chi Hsu.A hierarchical model-guide generation of Chinese characters. 1994 IEEE Proceeding of ICPR’94, p256-260
    5吕强,史磊,杨季文.TrueType字体格式初探.计算机研究与发展,1995,32(11):23-31
    6王瑜,荒源,张福炎.Windows中TrueType字形数据的存取技术.小型微型计算机系统, 1997,18(11)
    7肖明,胡金柱,赵慧.字形技术及OpenType字体文件格式研究.中文信息学报,1995,13(6)
    8何明,匡燕玲等.页面描述语言Postscript及其转换程序.北京工业大学学报,1994,20(4): 101-104
    9慈林林,陆国锋.基于PostScript的汉字笔划分解和曲线拟合研究.小型微型计算机系统,1994,15(3):34-39
    10 Candy L.K.Yiu,WaiWong. Chinese character synthesis using METAPOST. Proceedings of the 2003 Annual Meeting TUGboat,Volume 24(2003),No.1 p85-88
    11段华伟,黄灵阁.计算机文字处理技术现状.印刷质量与标准化,2004(5):39-41
    12 SHIN,JUNGPIL,SUZUKI,KAZUNORI.Handwritten Chinese Character Font Generation Based on Stroke Correspondence. International Journal of Computer Processing of Oriental Languages,2005,18(3):211-226
    13冯万仁,金连文.基于部件复用的分级汉字字库的构想与实现.计算机应用,2006,26(3): 714-716
    14 http://www.hifont.com/上海汉峰信息科技有限公司网站
    15王宁.计算机古籍字库的建立与汉字的理论研究.语言文字应用,1994(1):54-59
    16傅永和著.规范汉字.北京:语文出版社,1994
    17潘德孚,詹振权.汉字部件的研究.中文信息,1995(3):46-48
    18潘德孚.关于汉字部件类排序的意义和方法.温州师范学院学报(哲学社会科学版), 1995(4):30-31
    19费锦昌.现代汉字部件探究.语言文字应用,1996(2):20-27
    20王宁.汉字构形理据与现代汉字部件拆分.语文建设,1997(3):4-9
    21苏培成.汉字部件的拆分.语文建设,1997(3):10-13
    22崔永华.汉字部件和对外汉字教学.语言文字应用,1997(3):49-54
    23施正宇.外国留学生形符书写偏误分析.北京大学学报(哲学社会科学版),1999(4): 147-153
    24高晓梅.现代汉字的部件切分.佳木斯大学社会科学学报,2000,18(6):45-48
    25苏培成.现代汉字学纲要(增订本).北京大学出版社,2001
    26李宇明.中国现代的语言规划—附论汉字的未来.汉语学习,2001(5):13-17
    27王宁著.汉字构形学讲座.上海教育出版社,2002
    28费锦昌,徐莉莉.规范汉字印刷宋体字形标准研究报告.语言文字应用,2003(3):67-74
    29梁彦民.汉字部件区别特征与对外汉字教学.语言教学与研究,2004(4):76-80
    30厉兵主编.汉字字形研究.商务印书馆,2004
    31张希峰著.北京语言大学汉语语言学文萃(汉语史卷).北京语言大学出版社,2004
    32邢红兵著.基于统计的汉语字词研究.语文出版社,2005
    33冯志伟.用上下文无关语法来描述汉字结构.语言科学,2006,5(3):14-23
    34 Herng-Yow Chen, Kuo-Yu Liu. Web-based synchronized multimedia lecture system design for teaching/learning Chinese as second language. Computers & Education, 2008, 50(3), P693-702.
    35上海交通大学汉字编码组.汉字信息字典.科学出版社,1988
    36国家语言文字工作委员会. GF3001-1997信息处理用GB13000.1字符集汉字部件规范.北京:语文出版社,1997.12.1发布,1998.5.1实施
    37中华人民共和国国家标准GB13000.1信息技术多八位编码字符(UCS).北京:中国标准出版社
    38张小衡.正易全:一个动态结构笔组汉字编码输入法.中文信息学报,2003,17(3):59-65
    39张小衡.信息处理用GB13000.1字符集汉字部件规范在输入法应用中的难点讨论.中文信息学报,2004,18(4):60-65
    40张小衡.进一步的“正易全”----三级汉字编码输入法.中文信息学报,2005,19(1):98-104
    41 Ideographic Description,http://www.unicode.org/versions/Unicode4.0.0/ ch11.pdf:307-309
    42 http://www.eforth.com.tw/eforth.htm易符智慧科技公司网站
    43 Omega/CHISE: A Typesetting Framework based on the Character Information Service Environment, Kyoto University 21st Century COE Program, http://coe21.zinbun.kyoto-u.ac.jp/papers/ws-type-2003/077-Omega-CHISE.pdf
    44谢清俊,庄德明.数字典藏的缺字解决方案及应用.第二届两岸三院信息技术与应用研讨会,2004.6.1-5,台北.
    45 http://www.sinica.edu.tw/~cdp/台湾中央研究院信息科学研究所文献处理实验网站
    46孙星明,殷建平,陈火旺等.汉字的数学表达式研究[J].计算机研究与发展, 2002,39(6): 707-711
    47张问银,孙星明,曾振柄等.汉字数学表达式的自动生成[J].计算机研究与发展,2004, 41(5):848-852
    48 Richard Cook. A Specification for CDL(Character Description Language): an extract of [PhD Dissertation]. UC Berkeley,Dept.of Linguistics,2003
    49 http://www.wenlin.com/cdl/美国加州大学伯克利分校文林研究所网站
    50 Y. Liu, J. Tai, J.Liu, An introduction to the 4 million handwriting Chinese character samples library, in:Proceedings of the International Conference on Chinese Computing and Orient Language Processing, Changsha,China,1989.
    51 Hsi-Jian Lee,Hung-Chi Hsu.A hierarchical model-guided generation of Chinese characters. Proc.of the 12th Intern.conf.on Pattern Recognition, 256-260, Jerusalem,Israel,Oct.1994.
    52 Zen Chen,Chi-Wei Lee, Rei-Heng Cheng.Handwriten Chinese Character Analysis and Preclassification Using Stroke Structual Sequences. 1996 IEEE Proceeding of ICPR’96, p89-93
    53钱国良,洪勇等.基于机器学习的手写汉字识别的研究.模式识别与人工智能,1996,9(4): 353-358
    54唐降龙,孙广玲,刘家锋,容军.一种笔段序列匹配联机汉字识别方法.计算机研究与发展,1999,36(12):1472-1476
    55边肇祺.模式识别(第二版).清华大学出版社,2000
    56 H.Zhang,J.Guo,Introduction to HCL2000 database,in:Proceedings of Sino-JapanSymposium on Intelligent Information Networks,Beijing,China,2000.
    57王先旺,李涛等.智能神经网络系统原理在印刷体汉字识别中的应用.四川大学学报(工程科学版),2001,33(2):103-105
    58丁晓青.汉字识别研究的回顾.电子学报,2002,30(9):1364-1368
    59蔺志青,郭军.一种相似汉字的识别算法.中文信息学报,2002,16(5):44-48
    60蔺志青,郭军.贝叶斯分类器在手写汉字识别中的应用.电子学报,2002,30(12): 1804-1807
    61 Kuo-Chin Fan,Wei-Hsien Wu etc.A Symmetry-Based Coarse Classification Method for Chinese Characters.IEEE Transactions on System,Man,and Cybernetics-PartC, 2002,32(4):522-528
    62 DAMING SHI,ROBERT I.DAMPER etc.Offline Handwritten Chinese Character Recongition by Radical Decomposition.ACM Transaction on Asian Language Information Processing,2003,2(1):27-48
    63左文明,黎绍发,曾宪贵.BP算法在手写体汉字识别中的应用.计算机工程与设计,2003, 24(10):71-73
    64李晓辉,吴蓓等.基于部件的分类方法及在汉字识别中的应用.微电子学与计算机, 2003(10):17-19
    65杨静宇,魏兴国,孙怀江.一种快速SVM学习算法.南京理工大学学报, 2003,27(5):530-535
    66 C.Y.Suen,S.Mori,S.H.Kim,C.H.Leung,Analysis and recognition of Asian scripts-the state of the art, in: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh,Scotland,2003.
    67 Rabiner,L.R.,A tutorial on Hidden Markov Models and selected applications in speech recognition.Proc.IEEE.v77.257-285.
    68刘峡壁,贾云得.汉字笔段形成规律及其提取方法.计算机学报,2004,27(3):389-395
    69石大明,刘家锋,唐降龙等.手写汉字识别的非线性动态部件模板.自动化学报,2004, 30(3)
    70 LIN, LEI; WANG, XIAOLONG. COMBINING MULTIPLE CLASSIFIERS BASED ON A STATISTICAL METHOD FOR HANDWRITTEN CHINESE CHARACTER RECOGNITION.International Journal of Pattern Recognition & Artificial Intelligence,2005,19(8):1027-1040
    71 Shi, D.,Ng,G.S.Radical recognition of handwritten Chinese characters using GA-based kernel active shape modelling.IEE Proceedings -- Vision,Image & Signal Processing, 2005,152( 5):634-638
    72陈良育,曾振柄,张问银.汉字构形分析与识别.上海电力学院学报,2005,21(1):63-65
    73曹喆炯,王永成.笔顺连笔自由的联机手写汉字识别.计算机工程与应用,2005,29:167- 169
    74李国宏,施鹏飞.基于笔划方向特征和非对称分布的手写体汉字识别.上海交通大学学报,2005,39(12)
    75喻莹,杨杨,董才林.基于动态特征选择的手写体相似汉字的识别.计算机工程,2006, 32(17):10-12
    76王建平,蔺菲.基于笔划宽度提取的手写体汉字归一化方法.计算机技术与发展,2006,16(10):67-69
    77王建平,赵丽欣,王金玲.一种汉字识别的容错编码方法研究.计算机技术与发展,2006,16 (11):67-69
    78 Joseph B. Hellige, Maheen M. Adamson. Hemispheric differences in processing handwritten cursive.Brain and Language,2007,102(3):215-227
    79 Yang Ma, Graham Leedham.On-line recognition of handwritten Renqun shorthand for fast mobile Chinese text entry. Pattern Recognition Letters, 2007,28(7):873-883.
    80 Paul Morrison,Ju Jia Zou.Triangle refinement in a constrained Delaunay triangulation skeleton.Pattern Recognition,2007,40(10):2754-2765
    81 T.-H.Su,T.-W.Zhang,H.-J.Huang,Y.Zhou, HMM-based recognizer with segmentation -free strategy for unconstrained Chinese handwritten text,in: Proceedings of the 9th International Conference on Document Analysis and Recognition,2007.
    82 Varga,T.and Bunke,H.,Offline handwriting recognition using synthetic training data produced by means of a geometrical distortion model.Int.J.Pattern Recognition Artif. Intell.v18.1285-1302.
    83 T.-H.Su,T.-W.Zhang,Z.-W.Qiu.HMM-based system for transcribing Chinese handwriting, in: Proceedings of the 6th International Conference of Machine Learning and Cybernetics,Hong Kong,China,2007.
    84 C.-L.Liu, Handwritten Chinese character recognition: effects of shape normalization and feature extraction,in:Arabic and Chinese Handwriting Recognition,2008.
    85宫蓉蓉.基于SVM的手写体相似汉字识别.电脑与信息技术,2008,16(4):38-40
    86王开寿,王英伟.汉字字形的关系稳定原理.中文信息学报,1996,10(4):24-31
    87唐玉荣.生物信息学中一个优化的全局双序列比对算法.计算机应用,2004,24(6): 307-308
    88杜世宏,王桥,杨一鹏.一种定性细节方向关系的表达模型.中国图象图形学报,2004, 9(12):1496-1503 89杜世宏,王桥.不确定性空间关系.中国图象图形学报,2004,9(5)
    90姚正斌,丁晓青,刘长松.基于笔划合并和动态规划的联机汉字切分算法.清华大学学报(自然科学版)2004,44(10):1417-1421
    91杜世宏,王桥,杨一鹏,李治江.空间方向关系模糊描述.计算机辅助设计与图形学学报,2005, 17(08):1744-1751
    92李宗民.矩方法及其在几何形状描述中的应用.中国科学院研究生院(计算技术研究所),博士论文,2005
    93北京北大方正电子有限公司.北大方正典码使用手册.北京北大方正电子有限公司,2003.11
    94王晓龙,关毅等.计算机自然语言处理.清华大学出版社,2005
    95朱巧明,李培锋.中文信息处理技术教程.清华大学出版社,2005
    96李宝安,李燕,孟庆昌.中文信息处理技术--原理与应用.清华大学出版社,2005
    97翔英,章毓晋,小波轮廓描述符及在图像查询中的应用,计算机学报,1999,22(7):752-757
    98汪力新,戴汝为.三维仿射不变距,模式识别与人工智能.1998,11(2):133-139
    99普建涛,刘一,查红彬等.一种基于二维多边形集相似性的三维模型检索方法,第五届中国计算机图形学大会,2004.9
    100潘翔,张三元,张引,叶修梓.一种基于拓扑连接图的三维模型检索方法,计算机学报,2004.27(9):1250-1255