摘要
Long short-term memory RNNs(LSTMRNNs) have shown great success in the Automatic speech recognition(ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks.However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs(HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard(SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate(WER) than the conventional one.Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.
Long short-term memory RNNs(LSTMRNNs) have shown great success in the Automatic speech recognition(ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks.However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs(HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard(SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate(WER) than the conventional one.Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.
引文
[1]G.Hinton,L.Deng,D.Yu,et al.,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups”Signal Processing Magazine,IEEE,Vol.29,No.6,pp.82-97,2012.
[2]H.A.Bourlard and N.Morgan,“Connectionist speech recognition:A hybrid approach”,Springer Science and Business Media,2012.
[3]G.E.Dahl,D.Yu,L.Deng,et al,“Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition”,IEEE Transactions on Audio,Speech and Language Processing,Vol.20,No.1,pp.30-42,2012.
[4]F.Seide,G.Li,and D.Yu,“Conversational speech transcription using context-dependent deep neural networks”Proc.Annual Conference of International Speech Communication Association(Interspeech),pp.437-440,2011.
[5]P.Swietojanski,A.Ghoshal,and S.Renals,“Convolutional neural networks for distant speech recognition,”Signal Processing Letters,IEEE,Vol.21,No.9,pp.1120-1124,2014.
[6]J.Xu,J.Pan,and Y.Yan,“Agglutinative language speech recognition using automatic allophone deriving”,Chinese Journal of Electronics,Vol.25,No.2,pp.328-333,2016.
[7]W.Jiang,P.Liu,and F.Wen,“Speech magnitude spectrum reconstruction from MFCCs using deep neural network”,Chinese Journal of Electronics,Vol.27,No.2,pp.393-398,2018.
[8]H.Zhang,Q.Fu,and Y.Yan,“Speech Enhancement Using Compact Microphone Array and Applications in Distant Speech Acquisition,”Chinese Journal of Electronics,Vol.18,No.3,pp.481-486,2009.
[9]Y.Xie,J.Huang,and Y.He,“One Dictionary vs.Two Dictionaries in Sparse Coding Based Denoising”,Chinese Journal of Electronics,Vol.26,No.2,pp.367-371,2017.
[10]A.Graves,A.Mohamed,and G.Hinton,“Speech recognition with deep recurrent neural networks,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2013.
[11]H.Zen,and H.Sak,“Unidirectional long short-term memory recurrent neural network with recurrent output layer for lowlatency speech synthesis,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2015.
[12]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory recurrent neural network architectures for large scale acoustic modeling,”Annual Conference of the International Speech Communication Association(Interspeech),2014.
[13]Y.Zhang,G.Chen,D.Yu,et al.,“Highway long shortterm memory RNNs for distant speech recognition,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2016.
[14]Y.Bengio,P.Simard,P.Frasconi,“Learning long-term dependencies with gradient descent is difficult”,IEEETransactions on Neural Networks,Vol.5,No.2,pp.157-166,1994.
[15]L.LU,S.Renals,“Small-footprint deep neural networks with highway connections for speech recognition”,IEEETransactions on Audio,Speech and Lan-guage Processing,Vol.25,No.7,pp.1502-1511,2017.
[16]S.Hochreiter and J.Schmidhuber,“Long short-term memory,”Neural Computation,Vol.9,No.8,pp.17351438,1997.
[17]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,”Feb.2014.Available:http://arxiv.org/abs/1402.1128.
[18]C.Y.Lee,S.Xie,P.Gallagher,et al.,“Deeply-supervised nets,”Artificial Intelligence and Statistics,2015.
[19]Y.Bengio,P.Lamblin,D.Popovici,et al.,“Greedy layer-wise training of deep networks,”Proc.NIPS,2007,Vol.19,pp.153.
[20]G.E.Hinton and R.R.Salakhutdinov,“Reducing the dimensionality of data with neural networks,”Science,Vol.313,No.5786,pp.504-507,2006.
[21]R.K.Srivastava,K.Greff,and J.Schmidhuber,“Training very deep networks,”Proc.NIPS,2015.
[22]D.Povey,V.Peddinti,D.Galvez,et al.,“Purely sequencetrained neural networks for ASR based on lattice-free MMI”,Annual Conference of International Speech Communication Association(Interspeech),2016.
[23]K.Vesely,A.Ghoshal,L.Burget,et al.,“Sequencediscriminative training of deep neural networks.”Annual Conference of International Speech Communication Association(Interspeech),pp.2345-2349,2013.
[24]G.Saon,H.Soltau,D.Nahamoo,et al.,“Speaker adaption of neural network acoustic models using i-vectors.”Proc.IEEE Workshop on Automfatic Speech Recognition and Understanding(ASRU),pp.55-59,2013.