音频噪声环境下唇动信息在语音识别中的应用技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
传统语音识别研究只利用声学语音信息,而音视频双模态语音识别将说话人的唇动信息和声学语音信息一起作为特征参数,共同完成语音识别,为提高语音识别系统的鲁棒性和抗噪性能提供了一条新途径。本文着重研究音视频语音识别中视频图像的前端处理、视频特征提取、音视频信息融合等实际应用问题。本文主要工作如下:
     1)建立了一个针对车载控制系统的中文句子级双模态语音数据库(BiModal Speech Database, BiMoSp),由26人(14男12女)的数据构成。经过对多个驾驶员进行问卷调查后归纳出68条最常用的车载设备控制指令作为语料,每个说话人为每个控制语句提供4个音视频语音样本。
     2)提出一种基于多色彩空间的嘴唇区域定位算法。该算法将RGB空间的彩色边缘检测结果、HSV空间的色调以及饱和度分量相结合,并根据嘴唇的位置特性,对嘴唇区域的基准线进行调整,然后通过投影确定嘴唇边界点的位置,最后在二值图像中完成嘴唇区域定位。为了提高视频图像处理的鲁棒性,在实验中还引用其他数据库的部分图像,实验定位的准确率为98.25%,相对利用PCA的定位算法,准确率提高了3.37%。
     3)以提高轮廓提取精度和速度为目标,提出了一种利用多方向梯度信息和基于先验知识的改进几何活动轮廓(GAC)模型。将多方向梯度信息和嘴形椭圆形状的先验知识(Prior Shape)结合起来引入到Level Set的能量函数中,避免了传统GAC模型在嘴形轮廓提取中的不足。相比传统的GAC,该模型使嘴唇轮廓提取实验的准确率提高了8.38%。
     4)提出了一种基于帧间距离和线性判别投影变换(LDA)的动态特征提取方法。该方法弥补了差分特征的缺陷。利用该方法得到的特征不仅嵌入了语音分类的先验知识,而且捕捉了视觉特征的纹理变化信息。实验结果表明,由DTCWT变化而来的静态特征经过帧间距离运算,识别错误率相对降低了3.25%。而该静态特征经过LDA变换之后识别错误率相对降低了6.50%。LDA变化后的特征和一阶、二阶差分特征结合之后,相对静态特征,又可使识别错误率分别降低了9.44%和15.43%。将帧间距离和LDA差分得到最终的动态特征,其识别错误率相对静态特征降低了20.12%。
     5)提出了一种双训练模型来改善音视频特征融合的识别效果。从训练数据和测试数据不匹配而带来的噪声影响考虑,在不影响识别速度的前提下,使用噪声模型和基准模型来共同完成音视频特征融合语音识别。对在噪声环境下的基于英语音视频数据库(AMP-AVSp)和中文音视频双模态语音数据库(BiMoSp)的实验结果表明,使用双训练模型在高噪声情况下识别性能得到了很大地提高。对于AMP-AVSp和BiMoSp,在SNR=-5dB时,比仅使用基准模型识别的错误率分别降低了45.27%和37.24%。
     6)提出一种基于整数线性规划(Integer Linear Programming,ILP)的最优流指数选取的决策融合方法。根据决策融合中的似然概率线性相加特性,利用提出的最大对数似然距离(Maximum Log-Likelihood Distance,MLLD)为准则,建立了流指数选取模型。在实验中用梯度值为0.05的穷举搜索法选取的流指数做参考。实验结果表明,两种方法得到的流权值和音视频语音识别结果都很接近。因为穷举搜索法往往都能得到模型的最优解,两个模型实验结果的近似也反映了ILP模型能够为音视频决策融合选取出最优数据流指数以达到最佳识别效果。
Audio-visual speech recognition (AVSR), also known as bimodal speech recognition, has becoming a promising way to significantly improve the robustness of ASR. Motivated by the bimodal nature of human speech perception, work in this field aims at improving ASR by exploring the visual modality of the speaker’s mouth region in addition to the traditional audio modality. This thesis addresses some key issues of AVSR, namely lip contour extraction, visual feature extraction, audio visual fusion and so on. Some main contributions are proposed in the thesis:
     1) An audio-visual bimodal continuous Speech Database for vehicular voice control is collected. This database includes 26 persons (14 male, 12 female) speaking every continuous sentence 4 times. There are 68 sentences in this database,which are got from the conclusion of survey.
     2) An adaptive mouth region detection algorithm was presented based on multi-color spaces. The algorithm combined color edge detection in RGB color space and threshold segmentation in HSV color space. Furthermore, according to the mouth position of the face, an adaptive lip localization method was introduced to detect the position of mouth baseline automatically. Then the rectangular area of mouth region was detected by projection method. Experiment results show that the proposed algorithm can locate mouth region fast, accurately and robustly. The correct rate is to 98.25%. And compared to Principal Component Analysis (PCA), the accuracy improvement is 3.37%.
     3)To increase the accuracy and speed of lip contour extraction, we propose an improved Geometric Active Contours (GAC) model based on Prior Shape (PS) and mutil-directions of gradient information for lip contour detection. Here, the mutil-directions of gradient information and lip prior shape are introduced into the energy function of level set. The improved GAC model avoids the outline of lip contour extraction using tradition GAC model. Experiments results show that the accuracy improvement of the lip contour detection using PS-level set model is 8.38% over GAC model.
     4) A dynamic visual feature extracting method based on frame distance and LDA is also proposed. The proposed feature has not only captured important lip motion information, but also embodied a priori speech classification information. Evaluation experiments demonstrate that static feature with frame distance can significantly improve 3.25% for DTCWT. Static feature with LDA also can significantly improve 6.50% for DTCWT. Then With further delta and delta-delta augments, the recognition rate can improve 9.44% and 14.43% respectively. The final dynamic feature can make recognition accurate improvement is 20.12% compared to the static feature.
     5) A bimodal training model is proposed to improve the recognition rate of audio-visual feature fusion. Consider that the infect of noisy because of training data and testing data not match with each other, and the recognition speed, we use noisy training model and basic training model to finish audio-visual feature fusion speech recognition. Here, we use two audio-visual speech databases which are English AMP-AVSp and Mandarin BiMoSp to do the experiments. Experiments results show that using the bimodal training model can improve the recognition accuracy for both databases. Such as when SNR=-5dB in the testing data, the recognition improvements for both database are 45.27% and 37.24% respectively.
     6) A new weighting estimation method based on Integer Linear Programming (ILP) is developed to estimate the optimal exponent weighting for combining audio (speech) and visual (mouth) information in Audio-visual decision fusion. According to log-likelihood linear combination of the two streams and the rule of Maximum Log-Likelihood Distance (MLLD), the ILP model is built. In the experiments, we use exhaustive search (ES) and frames dispersion (FD) of hypothesis as compared methods. The results in ILP model are similar with ES model and are superior to FD model. As we know, ES can get the optimal result, that means ILP also can get the optimal stream weighting for Audio-visual decision fusion speech recognition.
引文
[1] Potamianos.G, Neti.C., Gravier.G, and et al. Recent advances in the automatic recognition of audio visual speech[J]. IEEE signal processing magazine, 2003, 91(9): 1306-1323
    [2] Sadaoki furui. 50 years of progresss in speech and speaker recognition reseach[J]. Ecti transactions on computer and information technology, 2005, 1(2): 64-74
    [3] Lippmann.R. P. Speech Recognition by Machines and Humans [J]. Speech Communication, 1997, 22(1): 1-15
    [4] Juang.B.H. Speech recognition in adverse environments[J]. Computer Speech and Language, 1999, 15(3): 275-294
    [5] Gales, M.J.F.“Nice”model based compensation schemes for robust speech recognition[C]. Proc. ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels. 1997:55-59
    [6] Jiang.H, Soong.F, Lee.C. Hierarchical stochastic feature matching for robust speech recognition[C]. Proc.International Conference on Acoustics, Speech and Signal Processing. 2001:217-220
    [7] Zeng Zhihong, M. Pantic, G. I. Roisman, T. S. Huang. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2009, 31(1): 39-58
    [8] S. T. Shivappa, M. M. Trivedi, B. D. Rao. Audiovisual Information Fusion in Human-Computer Interfaces and Intelligent Environments: A Survey[J]. Proceedings of the IEEE, 2010, 98(10): 1692-1715
    [9] Chen Tsubun. Audiovisual speech processing[J]. IEEE signal processing magazine, 2001: 9-21
    [10] Sumbyt.W. H., Pollack IRwin. The visual Contribution to Speech Intelligibility in Noise [J]. The journal of the acoustical society of America, 1954, 26(2): 212-215
    [11] McGurk.H, Madonald.J. Hearing lips and seeing voices [J]. Nature, 1976, 264:746-748
    [12] Dodd.B, Campbell.R, editors. Hearing by Eye: The Psychology of Lip-Reading[M]. Place Published: Lawrence Erlbaum Associates Ltd, London, 1987
    [13]徐彦君,杜利民,候自强.面向未来的交互信息技术――听觉视觉双模态语音识别(AVSR)[J].电子商务, 1999:第1期
    [14]徐彦君,杜利民,候自强.面向未来的交互信息技术――听觉视觉双模态语音识别(AVSR)[J].电子商务, 1999:第2期
    [15] S. Yildirim, S. Narayanan. Automatic Detection of Disfluency Boundaries in Spontaneous Speech of Children Using Audio-Visual Information[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(1): 2-12
    [16] Petajan.E.D, Bischoff.B.J, Bodoff.D.A, et al. Proved automatic lipreading system to enhance speech recognition [R]. Bell Labs Tech.Report TM11251-871012-11, 1987
    [17] Zhang.X, Mersereau.R.M., Clements.M.A. Visual speech feature extraction for improved speech recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, Orlando, Florida, USA: Institute of Electrical and Electronics Engineers Inc, 2002:1993-1996
    [18] Dupont.S, and Luettin.J. Audio-visual speech modeling for continuous speech recognition[J]. IEEE Transactions on Multimedia, 2000, 2(3): 141-151
    [19] Luettin.J. Visual speech and speaker recognition [D]. Ph. D. Dissertation of Department of Computer Science University of Sheffield, UK, 1997
    [20]徐彦君,杜利民,李国强.汉语听觉视觉双模态数据库CAVSR1.0 [J].声学学报, 2000, 25(1): 42-49
    [21]王东,蒙山,张有为.中文听觉视觉语音识别(CAVSR)双模态数据库的建立与结构[J].五邑大学学报, 2001, 15(1): 50-54
    [22]姚鸿勋,吕雅娟,高文.基于色度分析的唇动特征提取与识别[J].电子学报, 2002, 30(2): 168-172
    [23]姚鸿勋,高文,王端等.视觉语言-唇读综述[J].电子学报, 2001, 29(2): 239-246
    [24] C. Bregler, Y.Konig. Eigenlips for robust speech recognition [C]. In Proc. Int. Conf. Acoustics, Speech, and Signal Processing. 1994: 669-672
    [25] Duchnowski.P, Meier.U, Waibel.A. See me, hear me: Integrating automatic speech recognition and lip-reading [C]. Int.Conf. Spoken Language Processing, 1994:547-550
    [26] Huang.J, Potamianos.G, Neti.C. Improving audio-visual speech recognition with an infrared headset[C]. Proc. Auditory-Visual Speech Processing Tutorial and Research Workshop (AVSP), France, Sept 2003: 175~178
    [27] Zhang.Y. M, Diao.Q and el. al. DBN based Multi-stream Models for Speech[C]. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. 2003:836-839
    [28] Chiou.G., Hwang.J.N. Lipreading by using snakes ,principal component analysis and hidden Markov models to recognize color motion video[J]. IEEE Trans. on Image Processing, 1997, 6(8): 1192-1195
    [29] J. Luettin, N. A. Thacker. Speechreading using probabilistic models [J]. Computer Vision Image Understanding, 1997, 65:163-178
    [30] Potamianos.G, Graf.P. H, Cosatto.E. An image transform approach for HMM based automatic lipreading [C]. in Proc. Int. Conf.Image Processing. 1998, vol. 1: 173-177
    [31] Scanlon.P, Reilly.R. Feature analysis for automatic speechreading [C]. In Proc. Workshop Multimedia Signal Processing. 2001: 625-630
    [32] Nakamura.S, Ito. H, Shikano. K. Stream weight optimization of speech and lip image sequence f or audio-visual speech recognition[C]. In Proc. Int. Conf. Spoken Language Processing, 2000, 3: 20-23
    [33] I. Matthews, Timothy F.Cootes, J.Adrew Bangham, et.al. Extraction of visual features for lipreading[J]. IEEE transactions on pattern analysis and machine interlligence, 2002, 24(2): 198-212
    [34] Finn.K. An investigation of visible lip information to be used in automatic speech recognition [D]. Ph. D. thesis , Georgetown University ,Washington ,D. C , 1986
    [35] Mase.K, Pentland.A. Automatic lipreading by optical flow analysis [R]. Technical Report 117, MITMedia lab,1991
    [36] Chiou.G, Hwang.J. Lipreading from color video [J]. IEEE Trans. Image Processing, Aug, 1997, 6:1192-1195
    [37] Matthews.I. Features for audio-viusal recognition[D]. PHD theise,school of imformation systems university of east anglia,1998
    [38] Cootes.F. T, Edwards.J.G, Taylor.J.C. Active Appearance Models [C]. European Conf. Computer Vision. June, 1998: 484-498
    [39]谢磊.听视觉语音识别的关键问题研究[D].博士论文,西北工业学计算机学院,2004
    [40] Feng X.H, Wang W.N. DTCWT-based Dynamic Texture Features for Visual Speech Recognition[C]. IEEE Asia Pacific Conference on Circuits and Systems,Macao, Chnia. 2008: 497-500
    [41]严乐贫,奉小慧.双模态车载语音控制仿真系统的设计与实现[J].计算机与现代化, 2010, 8, 211-215
    [42] ]
    [43] Tomlinson.M. J, Russell.M.J, Brooke.N. M. Integrating audio and visual information to provide highly robust speech recognition[C]. In Proc. Int. Conf. Acoustics, Speech, and Signal Processing. 1996: 821–824
    [44] S. T. Shivappa, B. D. Rao, M. M. Trivedi. Audio-Visual Fusion and Tracking With Multilevel Iterative Decoding: Framework and Experimental Evaluation[J]. Selected Topics in Signal Processing, IEEE Journal of, 2010, 4(5): 882-894
    [45] K. Kumar, J. Navratil, E. Marcheret, V. Libal, G. Ramaswamy, G. Potamianos. Audio-visual speech synchronization detection using a bimodal linear prediction model[C]. Computer Vision and Pattern Recognition Workshops, 2009. IEEE Computer Society Conference on. 2009: 53-59
    [46] Huang Chung-Ming, Lin Chung-Wei, Chuang Cheng-Yen. A Multilayered Audiovisual Streaming System Using the Network Bandwidth Adaptation and the Two-Phase Synchronization[J]. Multimedia, IEEE Transactions on, 2009, 11(5): 797-809
    [47] Feng.W, Xie.L, J Zeng, and Liu.Z.Q. Audio-visual human recognition using semi-supervised spectral learning and hidden Markov models [J]. Journal of Visual Languages and Computing, 2009, 1(9): 1-8
    [48] Brooke.N. M, Tomlinson.M. J, and Moore.R. K. Automatic speech recognition that includes visual speech cues [C]. Proc. Inst. Acoustics-Autumn Conf. (Speech and Hearing). 1994, 16: 15-22
    [49] Neti.C, Potamianos.G, Luettin.J, Matthews.I and et al. Audio-Visual Speech Recognition[R]. Final Workshop 2000 Report. Baltimore, MD: Center for Language and Speech Processing, The Johns Hopkins University, 2000
    [50] Lawrence Rabiner, Bing- HwangJuang.语音识别基本原理-影印版[M]. Place Published:清华大学出版社, 1999
    [51] Zhang.X. Z. Automatic Speechreading for improved speech recognition and speaker verification[D]. Georgia Institute of Technology, April , 2002
    [52] Chibelushi.C. C, Deravi.F, Masont.J.S.D. A review of speech-based bimodal recognition[J]. IEEE Trans. on Multimedia, 2002, 4(1): 23-37
    [53] Nakamura.S. Statistical multimodal integration for audio-visual speech processing [J]. IEEE trans. on neural networks, July, 2002, 13(4): 854-866
    [54] Nakamura.S. Fusion of audio visual information for integrated speech processing[J]. In J. Bigun and F. Smeraldi, Eds. Audio and Visual Based Biometric Persion Authentication, Springer-Verlag, 2001:127-143
    [55] Sabri.G. Robust and efficient techniques for audio-visual speech recognition [D]. Ph.D Thesis, Clemson University, 2002
    [56]奉小慧,王伟凝,吴绪镇等.色彩空间的自适应嘴唇区域定位算法[J].计算机应用,2009, 29(7): 1924-1927
    [57]奉小慧,贺前华,王伟凝,严乐贫.基于PS-level set嘴唇几何形状定位模型[J].华南理工大学(自然科学学报), 2010, 38(2): 121-125
    [58] http://amp.ece.cmu.edu/projects/AudioVisualSpeechProcessing.
    [59] Chibelushit.C.C, Deravi.F, Masont.J.S.D. AudioVisual speech databases [R]. Internal Report -speech and Image Processing Research Group, Dept of Elevtrical and Electronic engineering ,University of Wales Swansea,1996
    [60] Chibelushit.C.C, Gandon.S, Masont.J.S.D, et.al. Design issues for a digital audio -visual integrated database,Integrated audio-visual processing for recognition, synthesis and communication[J]. IEE colloquium, 1996:1-7
    [61] K.Messer, J.Matas, J.Kittler, et al. Xm2vtsdb: The extended m2vts database [C]. In 2th International Conference on Audio and Video-based Bimetric Person Authentication, Washington D.C. 1999
    [62]赵晖,林成龙,唐朝京.基于视频三音子的中文双模态语料库的建立[J].中文信息学报, 2009, 23(5): 98-103
    [63]单卫,姚鸿勋,高文.唇读中序列口型的分类[J].中文信息学报, 2002, 16(1): 31-36
    [64]
    [65]严乐贫.音视频双模态车载语音控制系统的设计与实现[D].广州:华南理工大学,2010
    [66] Chen T P. Approximation capability in by multilayer feedforwark networks and related problems[J]. IEEE Trans, NN, 1995:57-67
    [67]冈萨雷斯著,阮秋琦等译.数字图像处理[M].北京:电子工业出版社,2005
    [68]姚敏等编著.数字图像处理[M].北京:机械工业出版社,2006
    [69] Nefian.A, Liang.L, Pi.X, et al. A couple HMM for audio-visual speech recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, Orlando, Florida, USA: Institute of Electrical and Electronics Engineers Inc, 2002: 2013-2016
    [70]陆继祥,张增芳,李陶深等.基于24位彩色人脸图像嘴唇的分割和提取[J].计算机工程, 2003, 29(2): 147-148
    [71]王琢玉,贺前华.基于主元分析的人脸特征点定位算法的研究[J].计算机应用, 2005: 2581-2583
    [72]吴暾华,周昌乐.一种鲁棒的人脸特征定位方法[J].计算机应用, 2007, 27(2): 327-336
    [73]汤秋艳,王琰.一种基于肤色分割的彩色图像人脸检测方法[J].沈阳理工大学学报, 2007, 26(5): 57-61
    [74] Cootes.T.F, Taylor.C.J, Cooper.D.H, andGraham.J. Active Shape Models-Their Training and Applications[J]. Computer Vision and Image Understanding, Jan.1995, 61(1): 38-59
    [75] Kass.M, Witkin.A, and Terzopoulos.D. Snakes: Active Contour Models[C]. In 1st International Conference on Computer Vision, London. June 1987: 259-268
    [76] Li.C, Xu.C, Gui.C, Fox.M.D. Level set evolution without re-Initialization: a new variational formulation[C]. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA. 2005:430-436
    [77] S.Osher, J.A.Sethian. Fronts propagating with curvature dependent speed: algorithms based on Hamilton-Jacobi formulation[J]. Journal of Computer Physics, 1988, 79(1): 12-49
    [78]杨猛,汪国平,董士海.基于Level set方法的曲线演化[J].软件学报, 2002, 13(9): 1858-1865
    [79]王怡,周明全,耿国华.基于简化Mumford-Shah模型的水平集图像分割算法[J].计算机应用, 2006, 26(8): 1848-1853
    [80] Cremers.D, Rousson.M, Deriche.R. A review of statistical aroaches to Level set segmentation: integrating color, texture, motion and shape [J]. International Journal of Computer Vision, 2007, 72(2): 195-215
    [81] Cremers D, Soatto Stefano. A Pseudo-distance for shape priors in Level set segmentation[C]. IEEE Workshop on Variational , Geometric and Level set Methods in Computer Vision , Springer-Verlag, Nice, Franch. 2003: 1-8
    [82] Xu.C, Prince.J, Snakes.L. shapes and gradient vector flow [J]. IEEE transactions on image processing, 1998, 7(3): 359-369
    [83] Chan Tony, Zhu Wei. Level set based shape prior segmentation [C]. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA: Institute of Electrical and Electronics Engineers Computer Society. 2005: 1164-1170
    [84]王瑞,高文,马继涌.一种快速、鲁棒的唇动检测与定位方法[J].计算机学报, 2001, 24(8): 866-871
    [85]王琢玉.基于肤色模型和主元分析的视觉特征研[D].广州:华南理工大学电子与信息学院, 2005
    [86] Osher.S, Sethian.J.A. Fronts propagating with curvature dependent speed: algorithms based on Hamilton-Jacobi formulation[J]. Journal of Computer Physics, 1988, 79(1): 12-49
    [87] Kingsbury.N. Image processing with complex wavelets[J]. philosophical transactions royal society, 1999, 357(9): 2543-2560
    [88] Kingsbury.G.N. Shift invariant properrties of the dualtree complex wavelet transform[C]. Proceedings of IEEE Internation conference on Acoustics, speech, and signal processing, March. 1999, 357(9): 2543-2560
    [89] Fischler.M. A, Bolles.R. C. Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography [J]. Communications of the ACM, 1981, 24(6): 381-395
    [90] M.Hennecke, K.Prasad, D.Stork. Using deformable templates to infer visual speech dynamics [C]. In Proceddings of the28th Annual A silomar Conference on Signals Systems and Computers. Pacific Grove: IEEE Computer Society Press. 1995 (1): 578-582
    [91] Stork.D. G, Hennecke.M.E. Speechreading by Humans and Machines. Berlin[J]. Springer, 1996, 461-471
    [92] .
    [93]蔡莲红,黄德智,蔡锐.语音技术基础与应用[M].北京:清华大学出版社,2003
    [94] M.S.Gray, J.R.Movellan, T.J.Sejnowski. Dynamic features for visual speech-reading:A systematic comparison[J]. In Advances inNeural Information Processing Systems, Cambridge, MA: MIT Press, 1997, vol. 9:751-755
    [95] Murphy.R. R. Biological and cognitive foundations of intelligent sensor fusion[J]. IEEE Transactions on System, Man, and Cybernetics, 1996, 26(1): 42-51
    [96] Trent W. Lewis, David M. W. Powers. Sensor fusion weighting measures in audio-visual speech recognition[C]. Proceedings of the 27th Australasian conference on Computer science. Dunedin, 2004: 305-314
    [97] Kittler.J. A framework for classifier fusion: Is it still needed?[J]. Advances in Pattern Recognition, Joint IAPR International Workshops SSPR 2000 and SPR 2000, 2000, (1876): 45-56
    [98] Adjoudani.A, Benoit.C. On the integration of auditory and visual parameters in an HMM-based ASR[J]. In Stork, D.G. and Hennecke, M.E. (Eds.), Speech reading by Humans and Machines, 1996, 461–471
    [99] P.Teissier, J.Robert-Ribes, J.L.Schwartz. Comparing models for audiovisual fusion in anoisy-vowel recognition task[J]. IEEE Transactions on Speech and Audio Processing, 1999, 7(6): 629-642
    [100] Potamianos.G, Neti.C. Automatic speechreading of impaired speech[C]. Proc. International Conference on Auditory-Visual Speech Processing, Aalborg, 2001: 177-182
    [101] Goecke.R, Potamianos.G, Neti.C. Noisy audio feature enhancement using audio-visual speech data[C]. Proc. International Conference on Acoustics, Speech and Signal Processing, 2002: 2025–2028
    [102] Mammone.R. J., Zhang.X, and Ramachandran.R. P. Robust speaker recognition: A feature-based approach[J]. IEEE Signal Processing Magazine, 1996, 13: 58-70
    [103] Allen.J. B. How do humans process and recognize speech?[J]. IEEE Transactions on Speech and Audio Processing, 1994, 2(4): 567-577
    [104] Fletcher.H. Speech and Hearing in Communication[M]. New York : Krieger, 1953:
    [105]Hermansky.H, Tibrewala.S, and Pavel.M. Towards ASR on partially corrupted speech[C]. Proc. International Conference on Spoken Language Processing, 1996: 462–465
    [106] Bourlard.H, Dupont.S. A new ASR approach based on independent processing and recombination of partial frequency bands[C]. Proc. International Conference on Spoken Language Processing. 1996: 426–429
    [107] http://www.speech.cs.cmu.edu/SLM/toolkit.html.
    [108]刘芝.基于HTK的连续蒙古语语音识别系统的研究[D].呼和浩特:内蒙古大学, 2003
    [109] Patterson.E. K, Gurbuz.S, Tufekci .Z and et al. Moving-Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus [J]. EURASIP Journal on Applied Signal Processing, 2003, 3

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700