结合瓶颈特征的注意力声学模型

英文篇名：Attention Based Acoustics Model Combining Bottleneck Feature
作者：龙星延 ; 屈丹 ; 张文林
英文作者：LONG Xing-yan;QU Dan;ZHANG Wen-lin;Information System Engineering College,PLA Information Engineering University;
关键词：声学模型 ; 注意力模型 ; 瓶颈特征 ; 深度置信网络
英文关键词：Acoustic model;;Attention model;;Bottleneck feature;;Deep belief network
中文刊名：JSJA
英文刊名：Computer Science
机构：解放军信息工程大学信息系统工程学院;
出版日期：2019-01-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金(61673395,61403415);; 河南省自然科学基金(162300410331)资助
语种：中文;
页：JSJA201901041
页数：5
CN：01
ISSN：50-1075/TP
分类号：267-271

摘要

目前基于注意力机制的序列到序列声学模型成为语音识别领域的研究热点。针对该模型训练耗时长和鲁棒性差等问题,提出一种结合瓶颈特征的注意力声学模型。该模型由基于深度置信网络(Deep Belief Network,DBN)的瓶颈特征提取网络和基于注意力的序列到序列模型两部分组成:DBN能够引入传统声学模型的先验信息来加快模型的收敛速度,同时增强瓶颈特征的鲁棒性和区分性;注意力模型利用语音特征序列的时序信息计算音素序列的后验概率。在基线系统的基础上,通过减少注意力模型中循环神经网络的层数来减少训练的时间,通过改变瓶颈特征提取网络的输入层单元数和瓶颈层单元数来优化识别准确率。在TIMIT数据库上的实验表明,该模型在测试集上的音素错误率降低至了17.80%,训练的平均迭代周期缩短了52%,训练迭代次数由139减少至89。
Currently,attention mechanism based sequence-to-sequence acoustic models has become a hotspot of speech recognition.In view of the problem of long training time and poor robustness,this paper proposed an acoustical model combining bottleneck features.The model is composed of the bottleneck feature extraction network based on deep belief network and the attention-based sequence-to-sequence model.DBN introduces the priori information of the traditional acoustic model to speed up the model convergence rate and enhance robustness and distinction of bottleneck feature.Attention model uses the time temporal information of voice feature sequence to calculate the posterior probability of phoneme sequence.On the basis of the baseline system,the training time is decreased by reducing the layer number of the recurrent neural network in the attention model,and the recognition accuracy is optimized by changing the input dims and outputs of the bottleneck feature extraction network.Experiments on TIMIT dataset show that in the core test set,the phoneme error rate decreases to 17.80%,the average time training time during an iteration decreases by 52%,and the epochs of training iterations decreases to 89 from 139.

引文

[1]HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
    [2]CHOROWSKI J,BAHDANAU D,CHO K,et al.End-to-end Continuous Speech Recognition using Attention-based Recurrent NN:First Results[EB/OL].https://arxiv.org/abs/1412.1602.
    [3]BAHDANAU D,CHOROWSKI J,SERDYUK D,et al.End-toend attention-based large vocabulary speech recognition[C]∥IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2016:4945-4949.
    [4]KIM S,HORI T,WATANABE S.Joint CTC-attention based end-to-end speech recognition using multi-task learning[C]∥IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2017:4835-4839.
    [5]GREZL F,FOUSEK P.Optimizing bottle-neck features for lvcsr[C]∥IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2008:4729-4732.
    [6]YU D,SELTZER M L.Improved Bottleneck Features Using Pretrained Deep Neural Networks[C]∥2011 Twelfth Annual Conference of the International Speech Communication Association.2011:237-240
    [7]LI J H,YANG J A,WANG Y.New Feature Extraction Method Based on Bottleneck Deep Belief Networks and Its Application in Language Recognition[J].Computer Science,2014,41(3):263-266.(in Chinese)李晋徽,杨俊安,王一.一种新的基于瓶颈深度信念网络的特征提取方法及其在语种识别中的应用[J].计算机科学,2014,41(3):263-266.
    [8]WANG Y,YANG J A,LIU H,et al.Bottleneck Feature Extraction Method Based on Hierarchical Deep Sparse Belief Network[J].Parttern Recognition and Artificial Intelligence,2015,28(2):173-180.(in Chinese)王一,杨俊安,刘辉,等.基于层次稀疏DBN的瓶颈特征提取方法[J].模式识别与人工智能,2015,28(2):173-180.
    [9]CHEN L,YANG J A,WANG Y,et al.A Feature Extraction Method Based on Discriminative and Adaptive Bottleneck Deep Belief Network in Large Vocabulary Continuous Speech Recognition System[J].Journal of Signal Processing,2015,31(3):290-298.(in Chinese)陈雷,杨俊安,王一,等.LVCSR系统中一种基于区分性和自适应瓶颈深度置信网络的特征提取方法[J].信号处理,2015,31(3):290-298.
    [10]BAHDANAU D,CHO K,BENGIO Y.Neural Machine Translation by Jointly Learning to Align and Translate[EB/OL].https://arxiv.org/abs/1409.0473.
    [11]CHO K,MERRIENBOER B V,GULCEHRE C et,al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[EB/OL].https://arxiv.org/abs/1406.1078.
    [12]MIAO Y.Kaldi+PDNN:Building DNN-based ASR Systems with Kaldi and PDNN[EB/OL].https://arxiv.org/abs/1401.6984.
    [13]PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training Recurrent Neural Networks[EB/OL].https://arxiv.org/abs/1211.5063v2.
    [14]HINTON G,DENG L,YU D,et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition:The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine,2012,29(6):82-97.
    [15]SUTSKEVER I,VINYALS O.Sequence to Sequence Learning with Neural Networ-ks[EB/OL].https://arxiv.org/abs/1409.3215.
    [16]GAROFOLO J S,LAMEL L F,FISHER W M,et al.TIMITAcoustic-Phonetic Continuous Speech(MS-WAV version)[J].Journal of the Acoustical Society of America,1993,88(88):210-221.
    [17]BERGSTRA J,BREULEUX O,BASTIEN F,et al.Theano:a CPU and GPU math expression compiler[EB/OL].http://conference.scipy.org/scipy2010/slides/james_bergstra_theano.pdf.
    [18]HINTON G E,OSINDERO S,TEH Y W.A Fast Learning Algorithm for Deep Belief Nets[J].Neural Computation,2014,18(7):1527-1554.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700