摘要
近些年来,语音识别任务中的说话人自适应技术在实际工程中得到广泛应用.基于i-vector的说话人自适应是其中最为重要的一种,但是提取i-vector需要用到整句话的信息,并不能用于线上的自适应.因此,本文设计了一种基于i-vector聚类字典及注意力机制的自适应框架,测试时能够在不提取i-vector和不进行二遍解码的前提下快速实现线上自适应,并且该框架具有灵活性优和可扩展性好的优点,能够方便的用于其他类型的自适应,如地域自适应和性别自适应.在Switchboard任务上,实验结果表明我们提出的框架在不同的声学模型上相对于基线均有性能提升,并且通过说话人识别任务进一步证明了该方法的合理性.
Recently,speaker adaptation in speech recognition has been widely used in practical engineering. Using auxiliary input feature i-vector has been seen as one of the most effective ways in speaker adaptation. However,extracting i-vector needs all the data of each sentence,which can not be applied for online adaptation. Therefore,this paper proposes a newadaptive framework based on ivector clustering dictionary and attention mechanism,so as to realize the online adaptation without extracting i-vector and avoiding two decodes while testing. This framework has the advantages of excellent flexibility and expansibility,making it easy to be used for adaptation in other aspects,such as geographical and gender adaptation. We report experimental results on the Switchboard speech recognition task showing that the proposed framework outperforms the baseline on different acoustic models. In addition,the rationality of the proposed framework is further demonstrated by speaker recognition tasks.
引文
[1]Liu F H,Stern R M,Huang X,et al. Efficient cepstral normaliza-tion for robust speech recognition[C]. Proceedings of the Work-shop on Human Language Technology,Association for Computa-tional Linguistics,1993:69-74.
[2]Zhan P,Waibel A. Vocal tract length normalization for large vo-cabulary continuous speech recognition[R]. Carnegie-M ellon UnivPittsburgh PA School of Computer Science,1997.
[3]Gauvain J L,Lee C H. Maximum a posteriori estimation for multi-variate Gaussian mixture observations of Markov chains[J]. IEEETransactions on Speech and Audio Processing,1994,2(2):291-298.
[4]Digalakis V V,Rtischev D,Neumeyer L G. Speaker adaptation u-sing constrained estimation of Gaussian mixtures[J]. IEEE Trans-actions on Speech and Audio Processing,1995,3(5):357-366.
[5]Zajíc Z,Machlica L,Müller L. Refinement approach for adapta-tion based on combination of M AP and fM LLR[C]. InternationalConference on Text,Speech and Dialogue,Springer,Berlin,Hei-delberg,2009:274-281.
[6] Neto J,Almeida L,Hochberg M,et al. Speaker-adaptation for hy-brid HM M-ANN continuous speech recognition system[C]. FourthEuropean Conference on Speech Communication and Technology,1995.
[7]Li B,Sim K C. Comparison of discriminative input and outputtransformations for speaker adaptation in the hybrid NN/HM M sys-tems[C]. Eleventh Annual Conference of the International SpeechCommunication Association,2010.
[8] Yu D,Yao K,Su H,et al. KL-divergence regularized deep neuralnetw ork adaptation for improved large vocabulary speech recogni-tion[C]. IEEE International Conference on Acoustics,Speech andSignal Processing,2013:7893-7897.
[9]Li X,Wu X. I-vector dependent feature space transformations foradaptive speech recognition[C]. Sixteenth Annual Conference ofthe International Speech Communication Association,2015.
[10] Delcroix M,Kinoshita K,Yu C,et al. Context adaptive deepneural netw orks for fast acoustic model adaptation in noisy condi-tions[C]. Acoustics,Speech and Signal Processing(ICASSP),IEEE International Conference on,2016:5270-5274.
[11]Miao Y,Zhang H,Metze F. Towards speaker adaptive training ofdeep neural netw ork acoustic models[C]. Fifteenth Annual Confer-ence of the International Speech Communication Association,2014.
[12]Wu C,Karanasou P,Gales M J F. Combining i-vector representa-tion and structured neural netw orks for rapid adaptation[C]. A-coustics,Speech and Signal Processing(ICASSP),IEEE Interna-tional Conference on,2016:5000-5004.
[13]Variani E,Lei X,Mc Dermott E,et al. Deep neural networks forsmall footprint text-dependent speaker verification[C]. Acoustics,Speech and Signal Processing(ICASSP),IEEE International Con-ference on,2014:4052-4056.
[14]Abdel-Hamid O,Jiang H. Fast speaker adaptation of hybrid NN/HM M model for speech recognition based on discriminative learn-ing of speaker code[C]. Acoustics,Speech and Signal Processing(ICASSP),IEEE International Conference on,2013:7942-7946.
[15]Ochiai T,Matsuda S,Watanabe H,et al. Bottleneck linear trans-formation netw ork adaptation for speaker adaptive training-basedhybrid DNN-HM M speech recognizer[C]. IEEE International Con-ference on Acoustics,Speech and Signal Processing(ICASSP),2016:5015-5019.
[16]Zhang C,Woodland P C. DNN speaker adaptation using parame-terised sigmoid and Re LU hidden activation functions[C]. IEEEInternational Conference on Acoustics,Speech and Signal Process-ing(ICASSP),2016:5300-5304.
[17] Swietojanski P,Renals S. Learning hidden unit contributions forunsupervised speaker adaptation of neural netw ork acoustic models[C]. Spoken Language Technology Workshop(SLT),IEEE,2014:171-176.
[18] Chu W,Chen R. Speaker cluster-based speaker adaptive trainingfor deep neural netw ork acoustic modeling[C]. IEEE InternationalConference on Acoustics, Speech and Signal Processing(IC-ASSP),2016:5295-5299.
[19]Luong M T,Pham H,Manning C D. Effective approaches to at-tention-based neural machine translation[J]. ar Xiv preprint ar Xiv:1508. 04025,2015.
[20]Chorowski J K,Bahdanau D,Serdyuk D,et al. Attention-basedmodels for speech recognition[C]. Advances in Neural InformationProcessing Systems,2015:577-585.
[21]Waibel A,Hanazawa T,Hinton G,et al. Phoneme recognition u-sing time-delay neural netw orks[J]. IEEE Transactions on Acous-tics,Speech,and Signal Processing,1989,37(3):328-339.
[22]Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neu-ral Computation,1997,9(8):1735-1780.
[23]Sak H,Senior A,Beaufays F. Long short-term memory recurrentneural netw ork architectures for large scale acoustic modeling[C].Fifteenth Annual Conference of the International Speech Communi-cation Association,2014.
[24]Kua J M K,Epps J,Ambikairajah E. i-Vector with sparse repre-sentation classification for speaker verification[J]. Speech Com-munication,2013,55(5):707-720.
[25]Senior A,Lopez-Moreno I. Improving DNN speaker independencew ith i-vector inputs[C]. Acoustics,Speech and Signal Processing(ICASSP),IEEE International Conference on,2014:225-229.