基于视频的唇部定位和序列切分算法的研究

英文题名：Research on the Lip Location and Image Sequence Segmentation Algorithm Based on Video
作者：姚文娟
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：唇读 ; 唇部定位 ; 序列切分
英文关键词：lip-reading ; lip location ; image sequence segmentation
学位年度：2011
导师：杜明辉
学科代码：081001
学位授予单位：华南理工大学
论文提交日期：2011-05-01
答辩委员会主席：姜胜明

摘要

唇读(Lipreading/speechreading),即是通过观察说话者的口型变化,“读出”所说的内容。唇读是人工智能,图像处理,模式识别等相关研究领域综合发展所产生的一个新的研究方向,它被广泛应用于语音识别的辅助手段,同时在安防系统的身份认证,辅助手语识别,听觉障碍人士的语言学习,基于唇动特点的生物特征识别等领域也有广阔的应用前景。
     一个完整的唇读系统通常包括人脸检测,唇部检测定位,图像序列的切分(端点检测),特征提取和唇语识别。其中,准确地将嘴唇实时检测和定位,是一切唇读系统的首要任务,它直接影响到后续的唇读工作。而对于一个视频,每个孤立字的图像序列的切分,则是唇读系统的又一个重要步骤,直接影响到唇读识别率。目前,用于唇读识别的孤立字切分都是基于音频的(基于听觉特征的),必然存在音节切分不完整的缺点,本文利用视觉和听觉融合的序列切分算法,提高了唇读识别率。本文的主要研究内容包括以下方面:
     (1)考虑到唇读视频数据库所占存储容量大,不利于共享和传播,以及鉴于本文的研究内容,本文自建了双模态数据库,并在此基础上进行后续的处理。
     (2)本文在利用OpenCV人脸检测模块检测出人脸之后,通过大量的实验,提出了利用人脸的结构特征和灰度信息进行唇部检测定位的方法,并完成了对唇部图像的归一化。该方法对头部运动和镜头的缩放具有较好的鲁棒性。
     (3)目前用于唇读识别的孤立字切分一般都是基于音频(基于听觉特征)的,比较经典的方法是基于短时能量的端点检测方法。本文以此为基础,在视觉通道上,利用图像比较的方法,提出了改进的切分算法,达到了视觉和听觉的融合。实验结果显示,本文方法能对孤立字进行更完整的切分,并且相对于基于听觉特征的切分,提高了唇读的识别率。
Lip-reading is“read out”the contents of speaker said by observing his lip movements. As a result of the joint development in artificial intelligence, image processing, pattern recognition and the relative researches, Lip-reading is a new research direction. It has been researched as complement to improve the speech recognition, and also been used for speaker identification in security system, for sign language recognition,for the language learning of hearing hard people, and for the biometric recognition.
     A complete lip reading system can consist of face detection, lip location, image sequence segmentation, lip movement extraction, and lip reading. Lip location is one of the most important steps of the lip-reading system, and its accuracy will affect the whole lip-reading system. For a video, another most important step is image sequence segmentation of every word, which can affect the recognition rate of lip-reading system. Up to now, all the image sequence segmentation methods are based on the Audio information, which can lead to incomplete segmentation. In this paper, we propose a segmentation method by combining audio information and video information to improve the recognition rate of lip-reading. The main work reads as follow:
     (1) Consider the difficulty in sharing and spreading of big video database for lip-reading, and the research contents of this paper, we setup a small video database for lip-reading.
     (2) Based on the face detection using OpenCV, we analyze the structure of a lot of people’s faces, and propose a lip location method using face structure and gray information. And, normalization of the lip image is also been completed. This method has invariable reference, so it can reflect the real size and shape of the lips, it is robust for the zoom and the movements of the face.
     (3) Up to now, all the image sequence segmentation methods are based on the Audio information, one of the most classic methods is audio segmentation method base on short-time energy. Based on this audio segment method, a new effective method is proposed to improve the recognition rate by using video information combined with audio information. Experimental results show that our method can make the segmentation complete and improve the recognition rate.

引文

[1] B.Dodd and R.Campbell, editors. Hearing by Eye: The Psychology of Lip-Reading [M]. Lawrence Erlbaum Associates Ltd, London, 1987
    [2]王瑞.连续语音唇读识别的研究[D].哈尔滨工业大学计算机系,哈尔滨工业大学档案馆,1998
    [3]陶宏.基于视频图像的唇读识别技术的研究[D].江苏大学计算机应用技术,计算机科学与通信工程学院,2005
    [4]姚鸿勋,高文,王瑞等.视觉语言-唇读综述[J].电子学报,2001, vol. 29: 239-246
    [5]单卫.计算机唇读系统的研究与实践[D].哈尔滨工业大学, 2002
    [6] Yuille, P Hallinan and D. Cohen. Feature Extraction from Faces Using Deformable Templates[J]. Computer Vision, 1992, 8(2): 99-111
    [7] Yow, Roberto Cipolla. Feature-Based Human Face Detection, CUED/F-INFENG/TR 249, August 1996
    [8] Tony. S and Alex Pentland. Parametrized Structure from Motion for Feedback Tracking of Faces [R]. MIT Media Technical Report 401, November 28 1996
    [9] M.Kirby and L.Sirovich. Application of the Karhunnen-Loeve Procedure for the Characterization of Human Faces [J]. IEEE Trans. Pattern. Analysis and Machine Intelligence, Jan 1990,12(1): 103-108
    [10] Viola Paul, Jones Michael. Rapid object detection using a boosted cascade of simple features [A]. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition [C]. Kauai, Hawaii, USA: 2001:511-518
    [11] Kanade T. Picture processing by computer complex and recognition of human faces [D]. Kyoto University, 1973
    [12]解明,马泳.基于对称变换的人脸图像眼睛定位方法[J].光学技术, 2004,30(2): 237-239
    [13]伊方平,赖剑煌,阮邦志.非均匀光照下人脸眼睛的定位方法[J].中山大学学报(自然科学版), 2003,42(3):12-15
    [14] LIENHARTR, MAYDT J. An Extended Set of Haar-like Features for Rapid Object Detection [J]. IEEE ICIP, 2002, 1: 900- 903
    [15]艾娟.人脸检测实现及眼睛定位算法研究[D].复旦大学, 2008
    [16]严云祥,郭志波,杨静宇.基于特征空间划分的Adaboost人脸检测算法[J].小型微型计算机系统,2007,11:2106-2109
    [17]刘瑞祯,于仕琪. OpenCV教程-基础篇[M ].北京:航空航天大学出版社, 2007
    [18] Xie Lei, Cai Xiu-Li, Fu Zhong-Hua. A robust hierarchical LIP tracking approach for lipreading and audio visual speech recognition [J]. Proc. Int. Conf, Mach.Learning Cybernetics, 2004, 6:3620-3624
    [19]王晓平,郝玉峰.一种自动的唇部定位及唇轮廓提取、跟踪方法[J].模式识别与人工智能,2007,20:487-491
    [20]孙宁,邹采荣,赵力.人脸检测综述[J].电路与系统学报, 2006, 11 (6): 101-113
    [21] Jong-Gook Ko, Kyung-Nam Kim. Facial Feauture Tracking for Eye-Head Controlled Human Computer Interface [J]. Proceedings of the IEEE Region 10 Conference, 1999 (11) : 72 - 75
    [22] R. Stiefelhagen, J. Yang. Real-time Lip Tracking for Lip reading [R]. Proceedings Eurospeech, 1997
    [23] T Wark,S Sridharan, V Chandran. An Approach to statistical Lip modeling for speaker identification via chromatic feature extraction [A]: In Proceedings of the IEEE International conference on pattern recognition [C], August 1998:123-125
    [24] Lew is T W, Powers DM. Lip Feature Extraction using Red Exclusion [A]. Proceedings selected papers from Pan-sydney workshop on visual information Proceessing [C]. Sydney, Australia: December, 2000: 61-67
    [25] Tarcisio Coianiz, Lorenzo Torresani, Bruno Caprile. 2D Deformable Models for Visual Speech Analysis: D Stork and M Hennecke, Speech Reading by Humans and Machines [J]. NY:Springer-Verlag, 1996:391-398
    [26] N.Eveno, A.Caplier, P.-Y. Codon. A New Color Transformation for lips segmentation [A]: IEEE Workshop on Multimedia Signal Processing [C]. Cannes: 2001-10:3-5
    [27] A. L. Yuille, P. Hallinan, D. S. Cohen. Feature Extraction from Faces Using Deformable Templates [J]. International Journal of Computer Vision, 1992, 8 (2): 99-112
    [28] M. Kass, A. Witkin, D. Terzopoulos. Snakes: Active Contour Models [J]. International Journal of Computer Vision, 1987, 3 (1): 321-331
    [29] J. Luettin, N. A. Thacker, S. W. Beet. Locating and Tracking Facial Speech Features [A]. Proc International Conference on Pattern Recognition [C], Vienna, Australian: 1996: 652-656
    [30]赵向阳,张有为.人脸主要特征位置标定与唇动序列跟踪[J].五邑大学学报(自然科学版), 2002,16:11-16
    [31] K. Uda, N. Tagawa, A. Minagawa, T. Moriya. Effectiveness Evaluation of Word Characteristics Obtained from 3-D Image Information for Lipreading [A]. 11th International Conference on Image Analysis and Processing [C]. 2001: 296-301
    [32]王晓平,郝玉峰等.一种自动的唇部定位及唇轮廓提取、跟踪方法[J].模式识别与人工智能, 2008, 20:487-491
    [33] Kanade T. Picture Processing by computer complex and recognition of human faces [D], Kyoto University, 1973
    [34] Feng GC, Yuan PC. Variance projection function and its application to eye detection for human face recognition [J]. Pattern Recognition Letters, 1998, 19(9): 899-906
    [35]洪晓鹏,姚鸿勋等.基于句子级的唇语语料库及其切分算法[J].计算机工程与应用, 2005,3:174-178
    [36]周治,杜利民等.汉语听觉视觉双模态信息的互补作用[J].中国科学, 2000,30:283-288
    [37]王炳锡,屈单等.实用语音识别基础[M].国防工业出版社,2005
    [38] C.E. Shannon. Communication Theory of Secrecy Systems [J]. Bell System Technical journal, 1949, Vol. 28, No. 4:656-715
    [39] C.E. Shannon. A mathematical theory of communication [J]. Bell System Technical Journal, 1948, Vol. 27, July and October:379-423 and 623-656
    [40]朱文佳,戚飞虎.快速人脸检测与特征定位[J].中国图像图形学报,2005,Vol. 10, No. 11:1451-1457
    [41] Petar S. Aleksic, Jay J. Williams, Zhilin Wu, Aggelos K. Katsaggelos. Audio-Visual Speech Recognition Using MPEG4 Compliant Visual Features [J]. EURASIP Journal on Applied Signal Processing, 2002, 11:1213-1227
    [42]陈益强,高文等.基于数据挖掘的语音驱动三维人脸动画合成[J].系统仿真学报,2001,13:30-110
    [43]徐彦君,杜利民.汉语听觉视觉双模态数据库CAVSR1.0[J].声学学报,2000, 25(1):42-49
    [44]陈庆利.基于音频和视觉特征的语音端点检测[D].贵州大学, 2007
    [45]刘庆升,徐霄鹏,黄文浩.一种语音端点检测方法的探究[J].计算机工程,2003,29(3):120-121,138

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700