Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition

英文篇名：Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition
作者：CHENG ; Gaofeng ; LI ; Xin ; YAN ; Yonghong
英文作者：CHENG Gaofeng;LI Xin;YAN Yonghong;Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics;University of Chinese Academy of Sciences;
英文关键词：Long short-term memory;;Highway connections;;Small-footprint;;Speech recognition
中文刊名：EDZX
英文刊名：电子学报(英文)
机构：Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics;University of Chinese Academy of Sciences;
出版日期：2019-01-15
出版单位：Chinese Journal of Electronics
年：2019
期：v.28
基金：supported by the National Key Research and Development Program(No.2016YFB0801203,No.2016YFB0801200);; the National Natural Science Foundation of China(No.11590774,No.11590770)
语种：英文;
页：EDZX201901015
页数：6
CN：01
ISSN：10-1284/TN
分类号：111-116

摘要

Long short-term memory RNNs(LSTMRNNs) have shown great success in the Automatic speech recognition(ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks.However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs(HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard(SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate(WER) than the conventional one.Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.
Long short-term memory RNNs(LSTMRNNs) have shown great success in the Automatic speech recognition(ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks.However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs(HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard(SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate(WER) than the conventional one.Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.

引文

[1]G.Hinton,L.Deng,D.Yu,et al.,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups”Signal Processing Magazine,IEEE,Vol.29,No.6,pp.82-97,2012.
    [2]H.A.Bourlard and N.Morgan,“Connectionist speech recognition:A hybrid approach”,Springer Science and Business Media,2012.
    [3]G.E.Dahl,D.Yu,L.Deng,et al,“Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition”,IEEE Transactions on Audio,Speech and Language Processing,Vol.20,No.1,pp.30-42,2012.
    [4]F.Seide,G.Li,and D.Yu,“Conversational speech transcription using context-dependent deep neural networks”Proc.Annual Conference of International Speech Communication Association(Interspeech),pp.437-440,2011.
    [5]P.Swietojanski,A.Ghoshal,and S.Renals,“Convolutional neural networks for distant speech recognition,”Signal Processing Letters,IEEE,Vol.21,No.9,pp.1120-1124,2014.
    [6]J.Xu,J.Pan,and Y.Yan,“Agglutinative language speech recognition using automatic allophone deriving”,Chinese Journal of Electronics,Vol.25,No.2,pp.328-333,2016.
    [7]W.Jiang,P.Liu,and F.Wen,“Speech magnitude spectrum reconstruction from MFCCs using deep neural network”,Chinese Journal of Electronics,Vol.27,No.2,pp.393-398,2018.
    [8]H.Zhang,Q.Fu,and Y.Yan,“Speech Enhancement Using Compact Microphone Array and Applications in Distant Speech Acquisition,”Chinese Journal of Electronics,Vol.18,No.3,pp.481-486,2009.
    [9]Y.Xie,J.Huang,and Y.He,“One Dictionary vs.Two Dictionaries in Sparse Coding Based Denoising”,Chinese Journal of Electronics,Vol.26,No.2,pp.367-371,2017.
    [10]A.Graves,A.Mohamed,and G.Hinton,“Speech recognition with deep recurrent neural networks,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2013.
    [11]H.Zen,and H.Sak,“Unidirectional long short-term memory recurrent neural network with recurrent output layer for lowlatency speech synthesis,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2015.
    [12]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory recurrent neural network architectures for large scale acoustic modeling,”Annual Conference of the International Speech Communication Association(Interspeech),2014.
    [13]Y.Zhang,G.Chen,D.Yu,et al.,“Highway long shortterm memory RNNs for distant speech recognition,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2016.
    [14]Y.Bengio,P.Simard,P.Frasconi,“Learning long-term dependencies with gradient descent is difficult”,IEEETransactions on Neural Networks,Vol.5,No.2,pp.157-166,1994.
    [15]L.LU,S.Renals,“Small-footprint deep neural networks with highway connections for speech recognition”,IEEETransactions on Audio,Speech and Lan-guage Processing,Vol.25,No.7,pp.1502-1511,2017.
    [16]S.Hochreiter and J.Schmidhuber,“Long short-term memory,”Neural Computation,Vol.9,No.8,pp.17351438,1997.
    [17]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,”Feb.2014.Available:http://arxiv.org/abs/1402.1128.
    [18]C.Y.Lee,S.Xie,P.Gallagher,et al.,“Deeply-supervised nets,”Artificial Intelligence and Statistics,2015.
    [19]Y.Bengio,P.Lamblin,D.Popovici,et al.,“Greedy layer-wise training of deep networks,”Proc.NIPS,2007,Vol.19,pp.153.
    [20]G.E.Hinton and R.R.Salakhutdinov,“Reducing the dimensionality of data with neural networks,”Science,Vol.313,No.5786,pp.504-507,2006.
    [21]R.K.Srivastava,K.Greff,and J.Schmidhuber,“Training very deep networks,”Proc.NIPS,2015.
    [22]D.Povey,V.Peddinti,D.Galvez,et al.,“Purely sequencetrained neural networks for ASR based on lattice-free MMI”,Annual Conference of International Speech Communication Association(Interspeech),2016.
    [23]K.Vesely,A.Ghoshal,L.Burget,et al.,“Sequencediscriminative training of deep neural networks.”Annual Conference of International Speech Communication Association(Interspeech),pp.2345-2349,2013.
    [24]G.Saon,H.Soltau,D.Nahamoo,et al.,“Speaker adaption of neural network acoustic models using i-vectors.”Proc.IEEE Workshop on Automfatic Speech Recognition and Understanding(ASRU),pp.55-59,2013.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700