音视频双模态车载语音控制系统的设计与实现

英文题名：Design of Speech Control System on Av Bimodal Information and Its Realization
作者：严乐贫
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：双模态语音识别 ; 车载控制 ; HTK ; ATK ; 仿真系统
英文关键词：bimodal speech recognition ; vehicular control ; HTK ; ATK ; simulation system
学位年度：2010
导师：贺前华
学科代码：081001
学位授予单位：华南理工大学
论文提交日期：2010-05-28
答辩委员会主席：胡永键

摘要

语音控制应用于行车环境有助于解放驾驶员的双手和双眼,提高驾驶安全和驾驶乐趣。目前噪声环境下单独依靠音频信息所得到的识别率很低,制约了车载语音控制的发展。利用视觉信息辅助语音识别能提高噪声环境下语音识别系统的识别率。行车过程中驾驶员位置固定,取像方便,使得在车载语音控制系统中利用视觉信息成为可能。车载语音控制系统中使用双模态语音识别抗噪声,已成为一个重要的研究课题。为了加快车载语音控制系统的研发进程,本文在PC机平台上构建了一个双模态车载语音控制仿真系统,为嵌入式车载语音控制系统的研发提供参考。本文主要工作如下:
     (1)论述了双模态语音识别基本原理及相关技术,并提出了双模态车载语音控制仿真系统的设计方案。系统整体构架采用中词汇量连续语音识别,音频特征选取能体现人耳听觉特征且抗噪性能较强的美尔频率倒谱系数(Mel Frequency Cepstral Coefficients, MFCC),声学模型采用隐马尔可夫模型(Hidden Markov Model, HMM),视频特征采用基于嘴唇轮廓的像素特征,听觉信息与视觉信息使用后融合的策略进行双模态语音识别。
     (2)结合车载语音控制的实际需要,构建了一个面向车载控制语音识别双模态数据库。分析了现有的国内外双模态数据库,归纳了建立双模态数据库的依据。参考建库依据,建立了车载语音控制双模态数据库。为减小数据库内语料标注的工作量,设计了标注软件,并进行了标注。
     (3)设计并实现了双模态车载语音识别控制系统。系统分为模型训练、离线识别和在线识别三个子系统,各子系统在结构上相互联系,功能上相互独立。各子系统由若干功能模块组成,且功能相同的模块在子系统中能通用。模型训练子系统分听觉和视觉两个通道训练了声学模型和视觉模型,供离线和在线识别子系统使用。研究了在Visual C/C++环境下调用ATK(Application Toolkit for HTK)接口进行音频信号的处理。为便于算法的升级,视频信号的处理模块采用动态链接库的方法。为了使系统能体现直观的测试结果,离线识别子系统中设计了结果统计功能模块。为了体现良好的人机交互和有效地降低外界语音的干扰,在线识别子系统中设计了人机语音对话式交互处理流程,以及结果的归一化处理和可选择处理。
     (4)评估了仿真系统在多种环境下的识别性能,并对评估结果进行了讨论。实验结果表明,与纯听觉的语音识别相比,双模态语音识别具有更好的抗噪性能,更适合应用于车载语音控制。
Voice control used in the automotive environment can liberate the driver's hands and eyes, and improve driving safety and pleasure. However, weak Audio- only speech recognition technology in noisy environment restricts the development of automotive voice control. There is another kind of Automatic speech recognition (ASR), which uses an video sequence of the speakers lips, called visual speech recognition (speech reading or lip-reading). Visual speech can improve the robustness of recognition system under noise environment. The application of audio-visual speech recognition in vehicular become better, because the driver's position fixed and it’s easier to get the visual feature. Nowadays, audio-visual speech recognition for voice control in vehicular become an important research topic. In order to expedite the study process, an audio-visual speech recognition simulation system is built for voice control in vehicular on PC. This simulation system provides reference for embedded speech control systems in vehicle. The main works in this thesis as follow:
     1) The basic knowledge of audio-visual speech recognition is studied, and the design is proposed for audio-visual vehicle control simulation system. Mel Frequency Cepstral Coefficients (MFCC) is used as the audio-only feature, which approximates the human auditory system's response and be robust in the presence of additive noise. Hidden Markov Model (HMM) is used as the acoustic model. The image pixel-based features in the mouth area are considered as visual-only features. Feature fusion and Decision fusion are discussed for audio-visual speech recognition.
     2) Bimodal Speech Recognition for vehicular Control database (BiMoSp) is collected. The rule of how to build an bimodal speech database is summed, according to the current audio-visual speech database in home and abroad. All the data in BiMoSp are labeled. An labeling soft is also designed to label the data, which reduce the amount of labeling work.
     3) Bimodal speech vehicular control system (BSVCS) is designed and carried out. There are three sub-systems in the BSVCS: model training、online recognition and offline recognition . These sub-systems have relation with each other on struct, but are independent on functions. These sub-systems are composed of many models. Some models can used in the two or above sun-systems. Model training sub-system includes audio training and visual training model. The output of model training sub-system will be used in online recognition and offline recognition sub-system. Application Toolkit for HTK(ATK) is used to do the audio-visual signal processing under Visual C\C++ program. Dynamic link library method is used in visual signal processing model in order to improve the algorithm. Offline recognition includes statistical function model which can show the result directly and intuitively. In online recognition, there are human-computer interaction processing flow model, results normalized model and optional processing model. These models are designed to show the good performance of the human-computer interaction and reduce the disturbing of audio noisy.
     4) The performance of the simulation system are tested in different environment. And the results of experiments are discussed. Experiments show that compare with traditional audio-only speech recognition, there is an great improvement on the audio-visual speech recognition. Audio-visual speech recognition is more useful for BSVCS.

引文

[1] Sadaoki Furui. 50 years of progresss in speech and speaker recognition research [J]. Ecti transactions on computer and information technology, 2005, 1(2): 64- 74
    [2] Gerasimos Potamianos, Chalapathy Neti, Guillaume Gravier, et al. Recent Advances in the Automatic Recognition of Audiovisual Speech[C]. Proceeding of the IEEE, 2003,91(9):1306- 1326
    [3] Chen Tsubun. Audiovisual speech processing, IEEE signal processing magazine [J], 2001, 18(1): 9-21
    [4] Juang B. H. Speech recognition in adverse environments[J]. Computer Speech and Language, 1991,5(3):275-294
    [5]刘芝.基于HTK的连续蒙古语语音识别系统的研究[D].呼和浩特:内蒙古大学.2003
    [6]刑东洋.车载语音控制指令识别算法的研究[D].哈尔宾:哈尔宾理工大学.2008
    [7] Petajan E. D. , Bischoff B. J. , Bodoff D. A. , et al. Proved automatic lipreading system to enhance speech recognition [R], Bell Labs Tech. Report TM, 11251- 871012- 11, 1987
    [8] Kaynak, Qi Zhi. Analysis of lip geometric features for audio-visual speech recognition systems[J]. IEEE Transactions on Man and Cybernetics, 2004, 34 (4): 564-570
    [9]赵燕燕,王丽荣,唇读技术及其最新发展研究概述[J].长春大学学报,2007, 17(5): 58-62. [ 10 ]姚鸿勋,吕雅娟,高文.基于色度分析的唇动特征提取与识别[J].电子学报,2002,30(2):168-172
    [11]吕国云,赵荣椿,蒋冬梅等.基于BTSM-LDA的口形动态及多流异步音视频语音识别[J].数据采集与处理.23(4):397-403
    [12]汤升庆.车载语音识别的应用设计[D].武汉:武汉理工大学.2007
    [13]于吉龙.车载语音识别系统设计开发[D].长春:吉林大学汽车工程学院,2007
    [14]陈卫民.语音技术引领车载导航新时代[J].安徽科技,2009,8:55-56
    [15]车载语音导航系统发展趋势分析[EB/OL]. http://www.motorlink.cn/html/marketInfo, 2009,07,02
    [16]车载免提装置并非真免提,语音识别引问题[EB/OL]. http://info.carec.hc360.com/2010/05/21085064541.shtml, 2010,05,21
    [17]徐彦君,杜利民,候自强.面向未来的交互信息技术[N].电子科技导报,第1期,1999
    [18]蔡莲红,黄德智,蔡锐.现代语音技术基础与应用[M].北京:清华大学出版社,2003
    [19] Bennetl R. , Syndal A. , Greenspan S. Applied speech technology[C]. USA Florida: CPC Press, 1995
    [20]杨行峻等.语音信号数字处理.北京:电子工业出版社,2003
    [21] Lawrence Rabiner. Fundamentals of Speech Recognition[M], Printice Hall, 1993
    [22] Rabiner L. R., Juang B. H. An introduction to hidden markov models[J]. IEEE ASSP Magzine 1986,3(1):4-16
    [23]赵力.语音信号处理.北京:机械工业出版社.2007
    [24]胡光锐,王均.一种抗噪声语音识别方法[J].上海交通大学学报, 1995, 29(3):6-11
    [25] Ruhi Sarikaya. Robust and efficient technology for speech recognition in noise[D]. Dissertation.Duke University of United States, 2001
    [26]潘杰林.基于投票表决法的抗噪语音识别算法研究[D].广州:华南理工大学,2007
    [27] Junqua Jean-Claude, Haton Jean-Paul. Robustness in automatic speech recongition Fundamentals and Application[M]. Kluwer Academic Publisher, 1996 [ 28 ] Han J., Han M., Park G.B.,et al. Relative mel-frequency cepstral coefficients compensation for robust telephone speech recogntion. Proc. European Conf. On Speech Communication and Technology, 1997,3:1531-1524 [ 29 ]王晓平,郝玉峰,付德刚,等.计算机唇读研究进展[J].数据采集与处理.2007,22(3):353-359
    [30]汤敏,王元全,夏德深.基于Snake模型的嘴部特征分割[J].计算机工程,2004,30(21):7- 9
    [31] Chandramohan D., Silsbee P. L. A multiple deformable template approach for visual speech recognition[C], 4th International Conference on Spoken Language Processing. Philadelphia, PA, USA: IEEE, 1996:50-53
    [32] Lee K. D., Lee M. J., Lee S. Y. Extraction of frame difference features based on PCA and ICA for lip reading[C]. International Joint Conference on Neural Networks. Montreal, Canada: IEEE, 2005: 232- 237
    [33]张建明,陶宏,王良民,等.基于SVD的唇动视觉语音征提取技术[J].江苏大学学报:自然科学版,2004,25(5):426-429
    [34] Matthews I., Potamianos G., Neti C., et al. A comparison of model and transform- based visual feature for audio-visual LVCSR[C]. IEEE International Conference on Multimedia and Expo, Tokyo, Japan: IEEE, 2001:1032-1035
    [35]奉小慧,王伟凝,吴绪镇,等.基于多色彩空间的自适应嘴唇区域定位算法[J].计算机应用,2009,29(7):1924-1926
    [36] Feng Xiaohui, He Qianhua, Wang Weining. An Improved GAC Model for Lip Contour Detection[C]. Beijing China: Proc. Of the 9th Int. Conf. on signal processing, 2008.1215-1218
    [37] Claude C. C., Deravi F., Mason S. D., et al. A Review of Speech - Based Bimodal Recognition[C]. IEEE Transaction on Multimedia, March 2002. 4(1): 23- 37
    [38] Luettin J. Visual Speech and Speaker Recognition[D]. University of Sheffield, l997
    [39]Neti C., Potamianos G., Luettin J., Matthews I., et al. Audio-visual speech recognition[C], Center Lang. Speech Process, Johns Hopkins Univ, Baltimore, MD, 2000
    [40] Matthews Lain, Cootes Timothy F., Bangham J.Andrew, et al. Extraction of Visual Features for Lip reading[C]. IEEE Transactions on Pattern Analysis and Machine Intelligence. Feb, 2002,24(2):198-212
    [41] Chalapathy Neti, Gerasimos Potamianos, Juergen Luettin, et al. Audio-Visual Speech Recognition[R]. IBM Workshop 2000 Final Report, Oct.2000
    [42]徐彦君,杜利民,李国强等.汉语听觉视觉双模态数据库CAVSR1.0[J].声学学报(中文版).2000,25(1):42-49
    [43]单卫,姚鸿勋,高文.唇读中序列口型的分类[J].中文信息学报,2002,16(1):31-36
    [44]赵晖,林成龙,唐朝京.基于视频三音子的汉语双模态语料库的建立[J].中文信息学报,2009,23(5):98-103
    [45]祖漪清.汉语连续语音库的语料设计[J].声学学报,1999,24(3):237-247
    [46]王东,蒙山,张有为.汉语听觉视觉语音识别(CAVSR)双模态数据库的建立与结构[J].五邑大学学报.2001,15(1):50-54
    [47] Steve Young, Gunnar Evermann. HTK book[R]. Cambridge University Engineering Department, 2006
    [48] Wavesurfer[EB/OL]. http://www.speech.kth.se/wavesurfer/index.html, 2005,10,01
    [49]王晓兰,周献中.格式正确的有限命令识别[J].计算机应用,2005,25(10):2230-2232
    [50]钟明辉.基于HTK的汉语数码语音识别研究[D].桂林:广西师范大学,2008
    [51]毕力格图,基于HTK建模的蒙古语连续语音识别系统的研究与实现[D].呼和浩特:内蒙古大学,2006
    [52]包世恩,蒙古语非特定人大词汇量连续语音识别系统的研究与实现[D].呼和浩特:内蒙古大学,2005
    [53]陈磊.计算机视觉类库OpenCV在VC中的应用[J].微计算机信息,2007,23(4-3):209-210
    [54]贾小军,喻擎苍.基于开源计算机视觉库OpenCV的图像处理[J].计算机应用与软件,2008,25(4):276-278
    [55] Feng Xiaohui, Wang Weining. DTCWT-based Dynamic Texture Features for Visual Speech Recognition[C]. Macao China: IEEE Asia Pacific Conference on Circuits and Systems, 2008.497-45

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700