基于OCR的调查问卷自动识别统计分析系统的开发与设计
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
目前,大部分的调查问卷都是以人工的方式进行数据的统计和分析。当前随着计算机技术的飞速发展,利用计算机技术对调查问卷图像进行识别统计分析已经成为了一种必然趋势。尽管在邮件分拣、银行票据分析、选票统计等应用领域已存在一些基于OCR技术的专用软件系统,但由于调查问卷版面固定、通用性差等特点,使得在实现自动识别方面是存在一定的问题。特别是在识别后的可视化方面,当前研究还不够深入。
     本文以调查问卷为研究对象重点研究调查问卷的识别统计技术,包括调查问卷版面结构的定义、识别区域的选择以及可视化显示。通过用户定义的调查问卷的描述文件,结合问卷固有信息进行问卷的自动识别统计,对于识别后的数据信息进行可视化显示。在获取调查问卷识别内容的过程中提出利用XML技术作为桥梁实现问卷信息由层次化、半结构化的XML数据转化为关系数据。由于进行识别扫描的图像前要进行图像的倾斜矫正而针对此问题提出在问卷描述文件中定制其特殊点通过其模式匹配实现图像的倾斜矫正。同时对于部分问卷图像的倾斜矫正则利用基于连通区域以及文字行之间的距离固定文字行较长的特点进行倾斜矫正。
     在XML进行映射生成识别所需内容的过程中主要利用其相关节点集的概念,通过节点直接映射完成由层次半结构化数据到关系数据的转化。调查问卷中的手写内容,则是利用其交截特征和孔洞特征等进行字符的识别。在其识别后利用平行坐标系进行多维数据的可视化显示。对于问卷信息利用平行坐标系进行显示信息重复率高的问题,给出随机扰动公式,对重复信息进行离散处理,最后进行聚类分析划分群组。对于划分后的群组利用刷技术进行不同群组的显示。利用上述研究初步实现了基于OCR调查问卷的识别统计分析系统。
Currently, a large part of the input data manually statistics during the questionnaire processing. With the high development of computer technology, using computer technology to automatically identify the questionnaire statistical analysis has become an inevitable trend. Although mail sorting, bank notes analysis, statistics and other applications have the votes, there are some special OCR technology based software system, but in the questionnaire format for content is not fixed, automatic identification in certain aspects of the problem. Especially after the visual identification, the current study was not thorough enough.
     In this paper, the questionnaire for the study questionnaire focuses on the identification of statistical techniques, including a questionnaire layout definition and description of the model structure, identify areas of selection, visual display, user-defined description of the questionnaire file, automatically generate identification documents to identify the contents of the knowledge of statistical analysis of the final analysis, automated identification. Questionnaire for the treatment presents a survey questionnaire template constraint description file with the questionnaire information extraction methods. The image recognition process, the image of the tilt correction is its recognition of the premise, this paper presents the use of their questionnaire template customization through its special point pattern matching of image tilt correction, the deviation or error when the image, the use of questionnaires The distance between text lines in the fixed characteristics of a long line of text images of the tilt correction of the questionnaire. Platform based on XML, without limitation, hierarchical structure, scalability and other properties using XML as a bridge between.
     Generated through the questionnaire to query the XML content mapping to generate the required identification, the main use of its implementation process related to the concept node set, by direct mapping done by the node level to semi-structured data into relational data. Because the questionnaire identified the handwriting, and in the process of identifying handwritten characters using its cross-sectional features and characteristics of holes for character recognition. After its identification in the parallel coordinate system through the use of multidimensional data visualization, because after the questionnaire data identified a relatively high repetition rate, add random perturbations decimal for display. In the second, based on visual clustering method using the parallel coordinate system in visualization. Achieved using the above preliminary study based on questionnaire OCR recognition statistical analysis system.
引文
[1]陈根方.OMR研究与原型系统开发:(硕士学位论文).浙江:浙江大学,2003.
    [2]管继斌,明德烈.基于游程的倾斜表格图像的快速检测和校正.华中科技大学学报,2005,20(08):351~354.
    [3]陈光.特定领域OCR系统的精度与速度问题研究:(博士学位论文).北京:北京邮电大学,2006.
    [4]蓝鹰.基于.NET的网上问卷调查及其可视化分析系统:(硕士学位论文).吉林:吉林大学,2005.
    [5]肖必强.XML数据编码与存储管理关键技术研究:(硕士学位论文).武汉:华中科技大学,2004.
    [6]习隆,益民.基于XML的信息模型的研究.徐州师范大学学报,2010,28(2):39~42.
    [7]罗成平,龚沛曾.图像匹配技术.微型电脑应用,2000,16(3):26~30.
    [8] Bremen L, Freedman J, H, Olsten R. Classification and Regression Trees. Computer Network,1984.
    [9]张澎,徐红云,王鲁达等.一种基于XML的树型代数.计算机仿真,2008,5(25):147~151.
    [10]沈剑沧,鲍培明.XML查询方法的设计与研究.计算机工程,2007,21(33):63~65.
    [11]孔令波,唐世渭,杨冬青等.XML数据的查询技术.软件学报,2007,6(18): 1400~1418.
    [12]杨文军,李涓子,王克宏.基于关系树模型实现数据转换.计算机科学,2004,11(31):114~117.
    [13]王静,孟小峰,王宇.以目标节点为导向的XML路径查询处理.软件学报,2005,5(16):828~837.
    [14]门爱华,周立柱,张亚鹏. XML数据库结构连接算法之分析.计算机科学,2007,6(34):136~139.
    [15]王国仁,乔百友,韩东红.基于分片的XML快速结构连接算法.计算机学报,2008,1(31):78~90.
    [16]吴爱华,张谧,乔健,等.使用模式树和物化视图进行XML查询.计算机工程,2004,30(15):47~49.
    [17]陶世群,富丽贞.一种高效非归并的XML小枝模式匹配算法.软件学报,2009,4(20):795~803.
    [18] Fernandez M, Tan W C, Susie D. trading between relations and XML. Computer Network, 2000, 33 (12): 723 ~ 745.
    [19]陈圣俭,孙晋,成文刚.一种基于模板匹配的图像倾斜校正算法.中国电力教育,2007.
    [20]黄红燕,叶绿.文字图像的整体倾斜矫正.计算机工程与应用,2006,06(58):123~125.
    [21]李硕明,付仲良,彭彬慧.基于字符特征的车牌字符特征的矫正.江汉大学学报2006,03(34):34~37.
    [22]吴小艳,王维庆,杨春祥等.基于模板匹配的数字图像识别算法.兵工自动化.2005,20(6):98~102.
    [23]靳天飞.对手写字符识别的探讨.山东建筑工程大学学报.2004,06(12):12~16.
    [24]胡东红,汪浩,艾君等.两种图像校正算法在实际应用中的比较.计算机工程与应用.2009,45(13):191~194.
    [25]边肇棋,张学工.模式识别.北京:清华大学出版社,2000.
    [26]陈溢南.基于特征提取和神经网络的手写数字识别:(硕士学位论文).广东:中山大学,2004.
    [27]董慧.手写数字识别中的特征提取和特征选择研究:(硕士学位论文).北京:清北京邮电大学,2007.
    [28]李建元.特征提取和特征选择在手写数字识别中的应用:(硕士学位论文).北京:北京邮电大学,2008.
    [29]杜彦蕊,李珍,宋伟宏.基于特征编码的手写字符识别技术.计算机工程,2004, 30(4):156~158.
    [30]王丁,梁海滨,闫瑶.基于字符特征提取的手写练字良好度判别.模式识别与仿真,2009,28(7):60~63.
    [31]贾娟,元文法,侯晓辉.基于不规则版面布局模型的区域划分和分区排序算法.计算机工程与应用,2003,6(30):51~53.
    [32] Mabuchi Y, Keota, Okay R. Full Pixel Matching between Images Non-linear Registration of Objects. Transactions on Computer Vision and Applications 2010, 4 (2): 1 ~ 14.
    [33] Robert Pleas, Richard Souvenir .A Survey of Manifold Learning for images. IPSJ Transactions on Computer Vision and Applications 2009 , 4 (1): 83 ~ 94.
    [34] Liu C Y, Lin H C, Kojima M. A Character Recognition Scheme Based on Object Oriented Design For Tibetan Buddhist Texts. Data Science Journal 2007, 6 (17): 122 ~ 124.
    [35] Abdicant S R, Kumar. Optical Character Recognition for printed Tamil text using Unicode. Seethe larksome et zhejiang University 2005 , 6 (11): 1297 ~ 1305.
    [36] Heraldic R, Shapiro L. Image Segmentation Techniques. Computer Vision Graphics and Image Processing, 1985, 29 (22): 100 ~ 132.
    [37]张文,胡俊.基于平行坐标技术的关联规则可视化模型.北京交通大学学报,2006,04,30(2):93~96.
    [38] Buzzes D, Davison V. Jamal of Visual Languages and Computing, 2003, 14 (23): 621 ~ 635.
    [39]翟旭君,李春平.平行坐标及其在聚类分析中的应用.计算机应用研究2005,20(2):124-126.
    [40]张文鹤,牛连强.数据挖掘过程中多维数据可视化技术研究与应用:(硕士论文).沈阳:沈阳工业大学,2007.
    [41]孟攀飞.基于平行坐标的可视化交互分类:(硕士论文).广州:华南理工大学,2010.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700