基于特征补偿的自动语音识别的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文主要研究的是自动语音识别中的前端噪声鲁棒性问题。众所周知,语音识别的根本目的就是使机器能够听懂人类的语言。在当前的实验室环境下,很多识别系统已经能够达到很好的性能。但在实际环境中,由于噪声的复杂多变和未知因素的干扰,系统性能往往会急剧下降以至于远远不能达到实用的目的。因此,噪声鲁棒性一直是语音识别研究中一个非常重要的方面。噪声鲁棒性的根源就在于训练环境和测试环境的失配。实际中这种失配是由语音采集环境的影响(如加性噪声、信道畸变等)以及说话人自身的影响(如说话风格、口音等)引起的,当然,我们也可以将这种失配都看成是噪声的影响。为了使语音识别系统在不同噪声环境下仍能具有较好的性能,就需要采用各种方法来增强识别系统的鲁棒性。
     噪声鲁棒性的方法多种多样,但一般来说可分为前端方法和后端方法两大类。前端方法集中于对语音信号本身或者语音特征做处理,达到消除或尽可能抑制噪声影响的效果;后端方法主要集中于增强语音声学模型自身的宽容度和适应能力,使模型能够容忍一定程度的噪声,或者调整模型参数使之跟上噪声环境的变化。本文主要是对噪声鲁棒性的前端方法进行了一些研究,改善了一些已有的方法,也提出了一些新的方法。
     首先,在本文第一章中,对语音识别技术的发展历程做了简单的概述,并重点介绍了一下基于统计建模框架下自动语音识别系统的几个重要组成部分。由于实际中噪声的多样化,使得噪声鲁棒性也出现了很多种方法,每种方法都有它的特点和适用范围。正是针对这种情况,论文在第二章中分别从鲁棒性特征的提取、语音增强、特征补偿/增强、模型补偿四个方面对噪声鲁棒性问题进行了比较全面的介绍和总结。
     在本文第三章中,首先介绍了基于显式模型的一阶矢量泰勒级数(VTS)离线特征补偿算法,但是离线算法在实用时并不完美,它最大的缺陷在于其巨大的运算量极大的降低了系统处理的效率。因此,在离线算法的基础上我们提出了实用化的一阶VTS特征补偿算法,它在保证离线算法性能的同时,大大提升了算法处理的实时性。
     虽然实用化的一阶VTS特征补偿算法取得了不错的效果,但是它和离线算法一样,对噪声均采用的是单高斯建模,而在实际环境中噪声是复杂多样的,这种情况下单高斯可能不能很好的描述噪声参数的分布特性,从而使干净语音估计不准最终影响到识别性能,针对以上问题,在本文第四章中,提出了对噪声多高斯建模的一阶VTS特征补偿算法。实验结果表明,噪声多高斯建模方法还是能够在一定程度上提高系统识别的性能。
This thesis is focused on the research topic of noise-robust front-end of automatic speech recognition (ASR).As we all know, the ultimate purpose of speech recognition is to make the computer understand human spontaneous language. And now many mature systems have got fairly high speech recognition rate in laboratory. However, the system’s performance is too much worse to be used in real environment because of disturbance of various noises and unknown factors. Therefore, the noise robustness is a very important part of speech recognition research. The derivation of noise robustness can come down to the mismatch between training and testing environment. In our real world, this mismatch is caused by the influences of the speech collecting environment (additive noise, convolutional noise, etc.) and speaker (speaking style, accent, etc.), we can also regard this mismatch as influences of noises. In order to make the speech recognition system maintain the good performance under these noise conditions, we must use various methods to enhance the robustness of system.
     The noise-robust methods are various and be roughly classified into two categories: front-end methods and back-end ones. The front-end methods focus on mitigating the effect of noises by processing the speech signal or speech feature, while the back-end ones try to adjust models to meet the change of environments, which make models and real environments match. This thesis is primarily focused on the research of front-end noise-robust methods, and then some existing algorithms are implemented, several new methods are proposed.
     Firstly, this thesis gives an overview and summary on the development history of ASR in chapter one, and highlight the several important components of ASR which is based on the statistical modeling.
     There are many kinds of noise-robust front-end methods because of the diversity of noises, and each has its character and in-point range. Therefore, general introductions and conclusions are made in chapter 2 from four aspects including robust feature extraction, speech enhancement, feature compensation/enhancement and model adaptation.
     In chapter 3, we firstly introduce the offline feature compensation based on first-order Vector Taylor Series (VTS) approximation using explicit model of environmental distortion. But the offline algorithm is not perfect in practice. The biggest disadvantage of it is its huge computation which will reduce the system processing efficiency. Therefore, a practical first-order VTS approximation is proposed; it keeps the performance comparable to the offline condition, and also greatly increases the efficiency of the algorithm
     Although the practical first-order VTS algorithm has achieved good performance, but as is the offline algorithm, it assumes that for each sentence, the noise feature vector in cepstral domain follows one single Gaussian PDF (probability density function), this may be not a suitable description of the noise distribution because of the diversity and complexity of noises, thus the clean speech is estimated inaccurate, ultimately affect the recognition performance. So a first-order VTS approximation which assumes the noise feature vector in cepstral domain follows multi-Gaussian PDF is proposed in chapter 4.The results show that this method can improve the system’s performance to some extent.
引文
[1] Furui S. 50 Years of Progress in Speech and Speaker Recognition Research. ECTI Trans on Computer and Information Technology, 2005, 1(2):64–74.
    [2] Davis K H, Biddulph R, Balashek S. Automatic Recognition of Spoken Digits. The Journal of the Acoustical Society of America, 1952, 24(6):637–642.
    [3] Forgie J W, Forgie C D. Results Obtained from A Vowel Recognition Computer Program. The Journal of the Acoustical Society of America, 1959, 31(11):1480–1489.
    [4] Sakoe H and Chiba S, Dynamic Programming Algorithm Optimization for Spoken Word Recognition, IEEE Trans. on Acoustics, Speech and Signal Processing, 1978, 26(1)43-49.
    [5] Itakura F. Minimum Prediction Residual Applied to Speech Recognition. IEEE Trans. on Acoustics, Speech and Signal Processing, 1975, 23(1):67–72.
    [6] Vintsyuk T K.Speech Discrimination by Dynamic Programming.Cybernetics and Systems Analysis, 1968, 4(1): 81-88.
    [7] Gray R. Vector quantization. IEEE ASSP Magazine,1984,32(1):4-29
    [8] Ferguson J, (eds.). Hidden Markov Models for Speech. IDA, Princeton, NJ, 1980.
    [9] Huang X, Acero A, Alleva F, et al. Microsoft Windows Highly Intelligent Speech Recognizer: Whisper. Proceedings of ICASSP, 1995. 93-96.
    [10] Schwartz R, Chow Y, Kimball O, et al. Context-Dependent Modeling for Acoustic-Phonetic Recognition of Continuous Speech. Proceedings of ICASSP, 1985. 1205-1208.
    [11] Juang B H, Levinson S, Sondhi M. Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains. IEEE Trans. on Information Theory, 1986, 32(2):307–309.
    [12] Rabiner L R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. IEEE, 1989, 77(2):257–286
    [13] Rabiner L R, Juang B H. Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliff, New Jersey, 1993.
    [14] Lee K F. Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The Sphinx System [D]. Carnegie Mellon University, 1988.
    [15] Lee K F, Hon H W, Reddy R. An Overview of the SPHINX Speech Recognition System. IEEE Trans. on Acoustics, Speech and Signal Processing, 1990,38(1):35–45.
    [16] Kawahara T, Lee C H, Juang B H. Key-phrase Detection and Verification for Flexible Speech Understanding. Proceedings of ICSLP, 1996. 861-864.
    [17] Soltau H, Saon G, Kingsbury B,et al.The IBM 2006 GALE Arabic ASR System. Proceedings of ICASSP2007, 2007. Vol.4, 349-353.
    [18] Davis S, Mermelstein P. Comparison of Parametric Representations for MonosyllabicWord Recognition in Continuously Spoken Sentences. IEEE Trans. on Acoustics, Speech, and Signal Processing, 1980, 28(4):357–366.
    [19] Hermansky H. Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical Society of America, 1990, 87(4):1738–1752.
    [20] Furui S. Speaker Independent Isolated Word Recognition Using Dynamic Features of Speech Spectrum. IEEE Trans. on Acoustics, Speech and Signal Processing, 1986, 34(1):52–59.
    [21] Baum L E, Eagon J A. An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology. Bulletin of American Mathematical Society, 1967, 73:360–363.
    [22] Katz S M.Estimation of Probabilities from Sparse Data for the Language Model Component of A Speech Recognizer.IEEE Tram.On Acoustics,Speech and Signal Processing,1987,35(3):400-401.
    [23] Viterbi A. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans. on Information Theory, 1967, 13(2):260–269.
    [24] Gong Y. Speech Recognition in Noisy Environments: A Survey. Speech Communication, 1995, 16(3):261–291.
    [25] Davis G M. Noise Reduction in Speech Applications. CRC Press, 2002.
    [26] Droppo J, Acero A. Environmental Robustness. Springer Verlag, 2008: 653–679.
    [27] Chen J, Benesty J, Huang Y, et al. Fundamentals of Noise Reduction. Springer Verlag, 2008: 843–871.
    [28] Hermansky H , Bayya N M A, and Kohn P.Compensation for The E_ect of The Communication Channel in Auditory-like Analysis of Speech (RASTA-PLP). Proceedings of EUROSPEECH, 1991. 1367-1370
    [29] Mansour D, Juang B H. A Family of Distortion Measures Based Upon Projection Operation for Robust Speech Recognition. IEEE Trans. on Acoustics, Speech, and Signal Processing, 1989, 37(11):1659–1671.
    [30] Boll S. Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Trans. on Acoustics, Speech, and Signal Processing, 1979, 27(2):113–120.
    [31] Compernolle D V. Increased Noise Immunity in Large Vocabulary Speech Recognition with The Aid of Spectral Subtraction. Proceedings of ICASSP, 1987. 1143-1146.
    [32] Agarwal A, Cheng Y M. Two-stage Mel-warped Wiener Filter for Robust Speech Recognition.Proceedings of ASRU, 1999. 67-70.
    [33] Acero A. Acoustic and Environmental Robustness in Automatic Speech Recognition [D]. Carnegie Mellon University, 1990.
    [34] Deng L, Acero A, Jiang L, et al. High-performance Robust Speech Recognition Using Stereo Training Data. Proceedings of ICASSP, 2001. 301-304.
    [35] Moreno P J. Speech Recognition in Noisy Environments [D]. Carnegie Mellon University, 1996
    [36] Gales M J F. Model-Based Techniques for Noise Robust Speech Recognition [D]. University of Cambridge, 1995.
    [37] Gauvain J L, Lee C. Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Trans. on Speech and Audio Processing, 1994, 2(2):291–298.
    [38] Dempster A P, Laird N M, and Rubin D B. Maximum likelihood from incomplete data wia the EM algorithm. Journal of the Royal Statistical Society B,39(1):1–22, 1977.
    [39] Leggetter C J, Woodland P C. Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language, 1995, 9(2):171–185.
    [40] Zweig G C. Speech recognition with dynamic Bayesian networks: [Ph. D. dissertation]. Berkeley; U. C. Berkeley, 1998
    [41] Kim D Y, Un C K, and Kim N S,“Speech recognition in noisy environments using first-order vector Taylor series,”Speech Communication, Vol. 24, pp.39-49, 1998.
    [42] Moreno P J, Raj B, and Stern R M,“A vector Taylor series approach for environment-independent speech recognition,”Proc. ICASSP, 1996, pp.733-736
    [43] Ding G H,“Maximum a Posteriori Noise Log-Spectral Estimation Based on First-Order Vector Taylor Series Expansion”IEEE Signal Processing Letters, 15: 158-161, 2008
    [44] Ding G H, Wang X, Cao Y,“Sequential Noise Estimation For Noise-robust Speech Recognition Based On 1st-Order VTS Approximation”in Proc. ASRU Workshop, 2005
    [45] Du J and Huo Q,“Feature compensation using high-order vector Taylor series for noisy speech recognition,”Technical Memo, MSRA, January 2008.
    [46] Ephraim Y, Malah D, Speech enhancement using a minimum mean square error short-time spectral amplitude estimateor[T]. IEEE Trans, Acoust, Speech, Singal Processing, 1984,32(6).1109-1121
    [47] M. F. Gales,“Model-Based Techniques for Noise Robust Speech Recognition”. Ph.D.Thesis, Engineering Department, Cambridge University, Sept. 1995, pp.58-59.
    [48] Wilpon J G, Rabiner L R. Modified K-means Clustering Algorithms for Use in Isolated Word Recognition [J]. IEEE Transactions on Acoustics, Speech, Signal Proc 1985, 33(3):587-598
    [49] R. C. Rose, E. M. Hofstetter, and D. A. Reynolds“Integrated Models of Signal and Background with Application to Speaker Identification in Noise”IEEE Trans on Speech and Audio Processing, Vol. 2, NO. 2, pp.245-257 1994

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700