基于改进混合CTC/attention架构的端到端普通话语音识别

英文篇名：End-to-end Mandarin speech recognition based on improved hybrid CTC/attention architecture
作者：杨鸿武 ; 周刚
英文作者：YANG Hong-wu;ZHOU Gang;College of Physics and Electronic Engineering,Northwest Normal University;
关键词：语音识别 ; 链接时序分类 ; 注意力机制 ; 混合CTC/attention ; 端到端系统
英文关键词：speech recognition;;connectionist temporal classification;;attention mechanism;;hybrid CTC/attention;;end-to-end system
中文刊名：XBSF
英文刊名：Journal of Northwest Normal University(Natural Science)
机构：西北师范大学物理与电子工程学院;
出版日期：2019-05-15
出版单位：西北师范大学学报(自然科学版)
年：2019
期：v.55;No.206
基金：国家自然科学基金资助项目(11664036,61263036);; 甘肃省高等学校科技创新团队项目(2017C-03)
语种：中文;
页：XBSF201903009
页数：6
CN：03
ISSN：62-1087/N
分类号：52-57

摘要

端到端的语音识别通过用单个深度网络架构表示复杂模块,减少了构建语音识别系统的难度.文中对传统的混合链接时序分类(Connectionist temporal classification, CTC)模型和基于注意力机制(Attention-based)模型的端到端语音识别架构进行了改进,通过引入动态调整参数对CTC模型和基于注意力机制模型进行线性插值,从而实现混合架构的端到端语音识别.将改进后的方法应用在中文普通话语音识别中,选择带投影层的双向长短时记忆网络(Bidirectional long short-term memory projection, BLSTMP)作为编码器网络模型,声学特征选取80维的梅尔尺度滤波器组系数和基频共83维特征.实验结果表明,与传统的端到端语音识别方法比较,文中方法在普通话语音识别上能够降低3.8%的词错误率.
End-to-end automatic speech recognition simplifies module-based architecture into a single-network architecture within a deep learning framework,which reduces the difficulty of building speech recognition systems.This paper improved the traditional hybrid connectionist temporal classification(CTC)/attention-based end-to-end speech recognition architecture.A hybrid architecture is realized by introducing dynamic adjustment parameters with linear interpolation between the CTC model and the attention-based model to realize an end-to-end speech recognition.The improved method is applied to the experiment of Mandarin speech recognition with abidirectional long short-term memory projection(BLSTMP) for the encoder network.80 mel-scale filter-bank coefficients alone with pitch features form a total of 83-dimensionals acoustic features to train the network.The experimental results on Mandarin end-to-end speech recognition show that the improved method has a 3.8% reduction on the word error rate.

引文

[1] HINTON G,DENG L,YU D,et al.Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J].IEEE Signal Processing Magazine,2012,29(6):82.
    [2] NIU J,XIE L,JIA L,et al.Context-dependent deep neural networks for commercial mandarin speech recognition applications[R]//2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.Taiwan,China,2013-10.
    [3] MüLLER L,PSUTKA J V.Comparison of MFCC and PLP parameterizations in the speaker independent continuous speech recognition task[R]//Seventh European Conference on Speech Communication and Technology.Aalborg,DENMARK,2001-09.
    [4] 朱小燕,王昱,徐伟.基于循环神经网络的语音识别模型[J].计算机学报,2001,24(2):213.
    [5] MOHAMED A,DAHL G E,HINTON G.Acoustic modeling using deep belief networks[J].IEEE Transactions on Audio Speech & Language Processing,2011,20(1):14.
    [6] HINTON G E,OSINDERO S,TEH Y W.A fast learning algorithm for deep belief nets[J].Neural Computation,2014,18(7):1527.
    [7] 张仕良.基于深度神经网络的语音识别模型研究[D].合肥:中国科学技术大学,2017.
    [8] DAHL G E,YU D,DENG L,et al.Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J].IEEE Transactions on Audio Speech & Language Processing,2011,20(1):30.
    [9] SEIDE F,LI G,YU D.Conversational speech transcription using context-dependent deep neural networks[R]//12th Annual Conference of the International Conference on Machine Learning.Florence,Italy,2011-08.
    [10] GRAVES A,FERNáNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.Pittsburgh,USA,2006:369.
    [11] GRAVES A,JAITLY N.Towards end-to-end speech recognition with recurrent neural networks[R]//International Conference on Machine Learning.Detroit,USA,2014-11.
    [12] MIAO Y,GOWAYYED M,METZE F.EESEN:End-to-end speech recognition using deep RNN models and WFST-based decoding[R]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding(ASRU).Scottsdale,USA,2015-12.
    [13] AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:end-to-end speech recognition in english and mandarin[R]//International Conference on Machine Learning.SNY,USA,2016-06.
    [14] CHOROWSKI J,BAHDANAU D,CHO K,et al.End-to-end continuous speech recognition using attention-based recurrent NN:first results[J].Eprint Arxiv,2014,25(6):342.
    [15] LU L,ZHANG X,RENAIS S.On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition[R]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Shanghai,China,2016-03.
    [16] CHAN W,JAITLY N,LE Q,et al.Listen,attend and spell:a neural network for large vocabulary conversational speech recognition[R]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Shanghai,China,2016-03.
    [17] CHIU C C,SAINATH T N,WU Y,et al.State-of-the-art speech recognition with sequence-to-sequence models[R]//International Conferenle on Acoustics,Speach and Signal Processing(ICASSP).Calgary,Canada,2018-10.
    [18] 姚煜,CHELLALI R.基于BLSTM-CTC和WFST的端到端中文语音识别系统[J].计算机应用,doi:10.11772/j.issn.1001-9081.2018020402.
    [19] 张立民,王彦哲,张兵强,等.基于CTC准则的普通话识别及改进[J].计算机工程,doi:10.19678/j.issn.1000-3428.0051065.
    [20] WATANABE S,HORI T,KIM S,et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing,2017,11(8):1240.
    [21] KIM S,HORI T,WATANABE S.Joint CTC-sttention based end-to-end speech recognition using multi-task learning[R]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New Orleans,USA,2017-03.
    [22] SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,2002,45(11):2673.
    [23] GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J].Neural Netw,2005,18(5):602.
    [24] CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[J].Computer Science,2015,10(4):429.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700