连续语音关键词识别系统中自适应技术的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
自动语音识别技术在当代人们的生活中有了越来越广泛的应用。目前自动语音识别又大致分为连续语音识别和关键词识别。相对于连续语音识别,关键词识别在提高系统对话自然度方面更有优势,因为它的特点是通过捕捉用户说话中包含重要信息的关键词而不是必须完全正确地识别出一句话中的每个词来理解其意。这对于在自然对话情景下口语的不规范、不连贯等问题也是一种很好的解决方案。
     在自动语音识别中,当训练语音和识别语音有较大差别时,将导致系统的识别率急剧下降。自适应技术就是利用少量的被测试人的语音调整系统参数,来缩小系统模型与被测试人之间的差距,提高识别率。
     本文主要目的是对说话人自适应技术和说话人归一化技术在关键词识别系统中的应用进行研究和探讨。研究的主要内容包括:
     1.基于连续隐马尔可夫模型(CHMM)框架的非特定人关键词识别基线系统的构建。探讨了构建此系统所涉及到语音预处理、特征参数提取、声学层模型的建立与训练、关键词检出、关键词确认等内容。并对基线系统进行了评价,提出了在基线系统中加入自适应模块的必要性。
     2.研究了说话人自适应技术和说话人归一化技术,并提出了将两种技术相结合的思想。实验表明在训练时加入说话人归一化技术,可以使训练得到的模型更具有说话人无关性,在此基础上进行自适应时能达到更高的识别率。在实验中对几种说话人归一化方法与自适应方法相结合的情况进行了比较和验证,并选择了说话人归一化方法中的说话人自适应训练方法(SAT)与受约束的最大似然线性回归(CMLLR)相结合的方案。
     3.结合构建的关键词基线系统,实现了一个面向股票信息查询的交互式语音查询系统,在系统中加入了说话人自适应模块,实现了两种自适应方案。最后对系统进行了评价,验证了本文探讨的自适应技术和说话人归一化技术的有效性。
Automatic speech recognition is used more and more widely in people’s life, which is categorized into continuous speech recognition and keyword spotting. Compared with continuous speech recognition, keyword spotting has advantage in increasing the naturalness of the dialogue. It is due to the user’s meaning is understood by catching the keywords with important information of his utterance, while there is no need to recognize every word accurately. Keyword spotting is also a good solution for problems of tongue, such as non-standard, incoherence, etc.
     When there are many differences between the speeches for training and the speeches for testing, the performance of the system is greatly degraded. Adaption technique can reduce the gap between system model and speakers by adjusting the parameters of the system using a few speeches from the speakers, which increases the recognition rates.
     In this thesis, we focus on the application of speaker adaption technique and speaker normalization technique in keyword spotting system for the following aspects:
     1. A baseline system of keyword spotting based on Continue Hidden Markov Model (CHMM) is constructed. We discuss the design of baseline system in detail, which includes speech pretreatment, feature extract, acoustic models establishing and training, keyword detection, and keyword verification, etc. Also we evaluate the baseline system and bring forward the necessity of adding adaption module in baseline system.
     2. Both the speaker adaption technique and speaker normalization technique are investigated, and then the idea of combining the two techniques is brought forward. Experimental results indicate that the trained model is more independent after adding speaker normalization technique in the training, and the adaption based on this model could achieve higher recognition rates. Comparation and validation of the combination between several speaker normalization methods and speaker adaption methods are done. We select the scheme of combining SAT and CMLLR.
引文
1 S. Furui. Speaker Adaptation Technology for Speech Recognition. J Inst Telev Eng Japan. 1989, 43:929–934.
    2 T. Nakamura. Speaker Adaptation for Speech Recognition. in Digital Signal Processing for Speech and Speech Data. Shoko Publishing; 1997. p122-140.
    3 李虎生. 汉语数码串语音识别及说话人自适应. 北京:清华大学电子工程系,2000.
    4 C. H. Lee, C. H Lin, B. H. Juang. A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models. IEEE Trans.on Acoustic and Speech Signal Processing, 1991, 39(4):806-814.
    5 J. L. Gauvain,C. H. Lee. Maximum a Posteriori Estimation for Multivariate Gaussian Observations. IEEE Trans. on Speech and Audio Processing , 1994, 2(2):291-298.
    6 S. M. Ahadi, P. C. Woodland. Rapid Speaker Adaptation Using Model Prediction. ICASSP. 1995.
    7 C. J. Leggetter, P. C. Woodland. Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models.Computer Speech and Language , 1995, 9(2):171-185.
    8 A. Sankar, C. H. Lee. Maximum Likelihood Approach to Stochastic Matching for Robust Speech Recognition. IEEE Trans. on Speech and Audio Processing, 1996, 4(1):190-202. A. C. Surendran, C. H. Lee. M Rahim. Nonlinear Compensation for Stochastic Matching. IEEE Trans. on Speech and Audio Processing,9 1999, 7(6):643-655.
    10 K. Shinoda, C. H. Lee. Structural MAP Speaker Adaptation Using Hierarchical Priors. Proc. IEEE Workshop on Automatic Speech Recognition and Understanding. Santa Barbara, 1997: 381-388.
    11 M. J. Lasry, R.M. Stern. A Posteriori Estimation of Correlated Jointly Gaussian Mean Vectors. IEEE Trans Pattern Anal Mach Intell. 1984, 6: 530-535.
    12 R. M. Stern, M. J. Lasry. Dynamic Speaker Adaptation for Feature-Based Isolated Word Recognition. IEEE Trans Audio Speech Process. 1987, 35: 751-763.
    13 S.M. Ahadi, P. C. Woodland. Combined Bayesian and Predictive Techniques for Rapid Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech & Language.1997, 11: 187-206.
    14 J. Takahashi, S. Sagayama. Vector-Field-Smoothed Bayesian Learning for Fast and Incremental Speaker/ Telephone-Channel Adaptation. Computer Speech and Language ,1997 ,11(2):127-146.
    15 V. Digilakis, D. Ritchev, L. Neumeyer. Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures. IEEE Trans. SAP.1995, 3: 357-366.
    16 M. J. F. Gales. Maximum Likelihood Linear Transformations for HMM-based Speech Recognition. Computer Speech & Language.1998, 12: 75-98.
    17 R. Kuhn, J. C. Junqua, P. Nguyen, N. Niedzielski. Rapid Speaker Adaptation in Eigenvoice Space. IEEE Trans. SAP, 2000, 8(6): 695-707.
    18 V. V. Digalakis, L. G. Neumeyer. Speaker Adaptation Using Combined Transformation and Bayesian Methods. IEEE Trans Speech Audio Process. 1996, 4: 294-300.
    19 W. Chou, Maximum a Posteriori Linear Regression with Elliptically Symmetric Matrix Variance Priors. Proc Eurospeech. 1999, 1: 1–4.
    20 R. Kuhn, P. Nguyen, J. C. Janqua, L. Goldwasser, N. Niedzielski, S. Fincke, K. Field, M. Contolini. Eigenvoices for Speaker Adaptation. Proc ICSLP. 1998: 1771–1774.
    21 K. Chen, W. Liau, H. Wang, L. Lee. Fast Speaker Adaptation Using Eigenspace-Based Maximum Likelihood Linear Regression. Proc ICSLP. 2000: 742–745.
    22 M. Sotomura, T. Kosaka, S. Matsunaga. Speaker Adaptation Using Maximum a Posteriori Parameter Estimation and a Smoothing Method That Can Adjust to the Size of the Adaptation Data. Trans IEICE. 1998;J81-DII: 465–471.
    23 T. Anastasakos, J. McDonough, R. Schwartz, J. Makhoul. A Compact Model for SpeakerAdaptive Training. Proc. ICSLP’96. Philadelphia. 1996: 1137-1140.
    24 S. E. Levinson, L. R. Rabiner, M. M. Sondhi. An Introduction to the Application of the Theory of Probabilistic Function of a Markov Process to Automat-ic Speech Recognition. BSTJ. 1983, 62(4):1035-1074.
    25 L. R. Rabiner. A Tutorial on Hidden Markov Model and Selected Application in Speech Recognition. Proc of IEEE. 1989, 77(2): 432-439.
    26 L. R. Rabiner, G. H. Juang. An Introduction to Hidden Markov Models. IEEEASSP Magazine, 1986, 3(1):4-16.
    27 欧嘉致,陈凯江,王秀萍,李宗葛. 基于 NN/HMM 混合模型的汉语短关键词识别系统.小型微机计算机系统. 2003,12.
    28 胡航编著. 语音信号处理. 哈尔滨工业大学出版社. 2000.
    29 S. J. Young. The Hidden Markov Model Toolkit (HTK) Book V3.3. Cambridge University, 2005.
    30 韩纪庆, 张磊, 郑铁然编著. 语音信号处理. 北京: 清华大学出版社. 2004, 9.
    31 任为民. 口语语音识别算法研究. 哈尔滨工业大学硕士学位论文. 2001.
    32 易克初, 田斌, 付强编著. 语音信号处理. 国防工业出版社. 2000, 6.
    33 C. J. Leggetter. Improved Acoustic Modeling for HMMs Using Linear Transform. Dissertation for Ph.D of Cambridge University. 1995
    34 G. Zavaliagkost, R. Schwatz, J. Makhoul. Batch, Incremental, and Instaneous Adaptation Techniques for Speech Recogniton. IEEE Proceedings of International Conference on Acoustic Speech Signal Processing. Australia: Causal Productions Pty Ltd. Rundle Mall, 1995: 676–679
    35 C. H. Lee, J. L. Gauvain. Speaker Adaptation Based on MAP Estimation of HMM Parameters. Proceeding of ICASSP93. 1993: 652–655
    36 王霞. 声学模型及其评价方法的研究. 清华大学硕士学位论文.1999
    37 Q. G. Lin, C. W. Che. Normalizing the Vocal Trace Lennth for Speaker Independent Speech Recognition. IEEE Signal Processing Letters. 1995, 2(11): 201–203
    38 R. Hariharan, O. Viikki. On Combining Vocal Tract Length Normalisation and Speaker Adaption for Noise Robust Speech Recognition. Proceedings of Eurospeech . Budapest. 1999.
    39 王炳锡,屈丹,彭煊编著.实用语音识别基础[M].北京:国防工业出版社,2005.227-232
    40 L. R. Bahl, F. Jelinek, R. L. Mercer. A Maximum Liklihood Approach to Continuous Speech Recognition. IEEE Trans. PAMI. 1983,5(2):179-190.
    41 杨行俊,迟惠生等.语音信号数字处理.北京:电子工业出版社,1995
    42 王昱.语音识别自适应技术的研究与实现:[硕士学位论文] .北京:清华大学计算机科学和技术系,2000
    43 L Lee, R Rose. A Frequency Warping Approach to Speaker Normalization. IEEE Transaction on Speech and Audio Processing, 1998,6(1):49-60.
    44 A Andreou, T Kamm, J Cohen. Experiments in Vocal Tract Normalization. Proceeding of the CAIP workshop: Frontiers in Speech Recognition.
    45 张金槐,唐雪梅. BAYES 方法.长沙:国防科技大学出版社,1993:64-68

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700