Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition
  • 作者:CHENG ; Gaofeng ; LI ; Xin ; YAN ; Yonghong
  • 英文作者:CHENG Gaofeng;LI Xin;YAN Yonghong;Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics;University of Chinese Academy of Sciences;
  • 英文关键词:Long short-term memory;;Highway connections;;Small-footprint;;Speech recognition
  • 中文刊名:EDZX
  • 英文刊名:电子学报(英文)
  • 机构:Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics;University of Chinese Academy of Sciences;
  • 出版日期:2019-01-15
  • 出版单位:Chinese Journal of Electronics
  • 年:2019
  • 期:v.28
  • 基金:supported by the National Key Research and Development Program(No.2016YFB0801203,No.2016YFB0801200);; the National Natural Science Foundation of China(No.11590774,No.11590770)
  • 语种:英文;
  • 页:EDZX201901015
  • 页数:6
  • CN:01
  • ISSN:10-1284/TN
  • 分类号:111-116
摘要
Long short-term memory RNNs(LSTMRNNs) have shown great success in the Automatic speech recognition(ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks.However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs(HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard(SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate(WER) than the conventional one.Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.
        Long short-term memory RNNs(LSTMRNNs) have shown great success in the Automatic speech recognition(ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks.However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs(HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard(SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate(WER) than the conventional one.Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.
引文
[1]G.Hinton,L.Deng,D.Yu,et al.,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups”Signal Processing Magazine,IEEE,Vol.29,No.6,pp.82-97,2012.
    [2]H.A.Bourlard and N.Morgan,“Connectionist speech recognition:A hybrid approach”,Springer Science and Business Media,2012.
    [3]G.E.Dahl,D.Yu,L.Deng,et al,“Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition”,IEEE Transactions on Audio,Speech and Language Processing,Vol.20,No.1,pp.30-42,2012.
    [4]F.Seide,G.Li,and D.Yu,“Conversational speech transcription using context-dependent deep neural networks”Proc.Annual Conference of International Speech Communication Association(Interspeech),pp.437-440,2011.
    [5]P.Swietojanski,A.Ghoshal,and S.Renals,“Convolutional neural networks for distant speech recognition,”Signal Processing Letters,IEEE,Vol.21,No.9,pp.1120-1124,2014.
    [6]J.Xu,J.Pan,and Y.Yan,“Agglutinative language speech recognition using automatic allophone deriving”,Chinese Journal of Electronics,Vol.25,No.2,pp.328-333,2016.
    [7]W.Jiang,P.Liu,and F.Wen,“Speech magnitude spectrum reconstruction from MFCCs using deep neural network”,Chinese Journal of Electronics,Vol.27,No.2,pp.393-398,2018.
    [8]H.Zhang,Q.Fu,and Y.Yan,“Speech Enhancement Using Compact Microphone Array and Applications in Distant Speech Acquisition,”Chinese Journal of Electronics,Vol.18,No.3,pp.481-486,2009.
    [9]Y.Xie,J.Huang,and Y.He,“One Dictionary vs.Two Dictionaries in Sparse Coding Based Denoising”,Chinese Journal of Electronics,Vol.26,No.2,pp.367-371,2017.
    [10]A.Graves,A.Mohamed,and G.Hinton,“Speech recognition with deep recurrent neural networks,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2013.
    [11]H.Zen,and H.Sak,“Unidirectional long short-term memory recurrent neural network with recurrent output layer for lowlatency speech synthesis,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2015.
    [12]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory recurrent neural network architectures for large scale acoustic modeling,”Annual Conference of the International Speech Communication Association(Interspeech),2014.
    [13]Y.Zhang,G.Chen,D.Yu,et al.,“Highway long shortterm memory RNNs for distant speech recognition,”Proc.International Conference on Acoustics,Speech and Signal Processing(ICASSP),2016.
    [14]Y.Bengio,P.Simard,P.Frasconi,“Learning long-term dependencies with gradient descent is difficult”,IEEETransactions on Neural Networks,Vol.5,No.2,pp.157-166,1994.
    [15]L.LU,S.Renals,“Small-footprint deep neural networks with highway connections for speech recognition”,IEEETransactions on Audio,Speech and Lan-guage Processing,Vol.25,No.7,pp.1502-1511,2017.
    [16]S.Hochreiter and J.Schmidhuber,“Long short-term memory,”Neural Computation,Vol.9,No.8,pp.17351438,1997.
    [17]H.Sak,A.Senior,and F.Beaufays,“Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,”Feb.2014.Available:http://arxiv.org/abs/1402.1128.
    [18]C.Y.Lee,S.Xie,P.Gallagher,et al.,“Deeply-supervised nets,”Artificial Intelligence and Statistics,2015.
    [19]Y.Bengio,P.Lamblin,D.Popovici,et al.,“Greedy layer-wise training of deep networks,”Proc.NIPS,2007,Vol.19,pp.153.
    [20]G.E.Hinton and R.R.Salakhutdinov,“Reducing the dimensionality of data with neural networks,”Science,Vol.313,No.5786,pp.504-507,2006.
    [21]R.K.Srivastava,K.Greff,and J.Schmidhuber,“Training very deep networks,”Proc.NIPS,2015.
    [22]D.Povey,V.Peddinti,D.Galvez,et al.,“Purely sequencetrained neural networks for ASR based on lattice-free MMI”,Annual Conference of International Speech Communication Association(Interspeech),2016.
    [23]K.Vesely,A.Ghoshal,L.Burget,et al.,“Sequencediscriminative training of deep neural networks.”Annual Conference of International Speech Communication Association(Interspeech),pp.2345-2349,2013.
    [24]G.Saon,H.Soltau,D.Nahamoo,et al.,“Speaker adaption of neural network acoustic models using i-vectors.”Proc.IEEE Workshop on Automfatic Speech Recognition and Understanding(ASRU),pp.55-59,2013.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700