摘要
为了进一步提高基于深度神经网络的语音增强方法的性能,针对单独使用卷积神经网络难以对含噪语音中的长期依赖关系进行建模的问题,提出一种基于卷积门控循环神经网络的语音增强方法.该方法首先采用卷积神经网络提取含噪语音中的局部特征,然后采用门控循环神经网络将含噪语音中不同时间段的局部特征进行关联,通过结合两种网络的不同特性,在语音增强中更好地利用含噪语音中的上下文信息.实验结果表明:该方法能够有效提高未知噪声条件下的语音增强性能,增强后的语音具有更好的语音质量和可懂度.
In order to further improve the performance of speech enhancement methods based on deep neural networks,a speech enhancement method based on the convolutional gated recurrent neural network was proposed for the problem that it is difficult to model long-term dependencies in noisy speech using convolutional neural networks alone.First,the local feature of noisy speech was extracted using a convolutional neural network,and then the local feature in different time periods was correlated using a gated recurrent neural network.By combining the different characteristics of these two networks,the method made full use of the contextual information in noisy speech in speech enhancement.Experimental results show that the method can effectively improve the speech enhancement performance under unknown noise conditions,and the enhanced speech has better speech quality and intelligibility.
引文
[1]LOIZOU P C.Speech enhancement:theory and practice[M].Boca Raton:CRC Press,2013.
[2]刘文举,聂帅,梁山,等.基于深度学习语音分离技术的研究现状与进展[J].自动化学报,2016,42(6):819-833.
[3]WANG Y,WANG D L.Towards scaling up classification based speech separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(7):1381-1390.
[4]WANG Y,NARAYANAN A,WANG D L.On training targets for supervised speech separation[J].IEEE/ACMTransactions on Audio,Speech,and Language Processing,2014,22(12):1849-1858.
[5]XU Y,DU J,DAI L R,et al.An experimental study on speech enhancement based on deep neural networks[J].IEEE Signal Processing Letters,2014,21(1):65-68.
[6]XU Y,DU J,DAI L R,et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(1):7-19.
[7]HUANG P S,KIM M,HASEGAWA-JOHNSON M,et al.Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACMTransactions on Audio,Speech,and Language Processing,2015,23(12):2136-2147.
[8]WENINGER F,ERDOGAN H,WATANABE S,et al.Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR[C]//Proc of International Conference on Latent Variable Analysis and Signal Separation.Liberec:Springer International Publishing,2015:91-99.
[9]CHEN J,WANG D.Long short-term memory for speaker generalization in supervised speech separation[J].Journal of the Acoustical Society of America,2017,141(6):4705-4714.
[10]袁文浩,孙文珠,夏斌,等.利用深度卷积神经网络提高未知噪声下的语音增强性能[J].自动化学报,2018,44(4):751-759.
[11]SAINATH T N,VINYALS O,SENIOR A,et al.Convolutional,long short-term memory,fully connected deep neural networks[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.Brisbane:IEEE,2015:4580-4584.
[12]GERS F A,SCHMIDHUBER J,CUMMINS F.Learning to forget:continual prediction with LSTM[J].Neural Computation,2000,12(10):2451-2471.
[13]JOZEFOWICZ R,ZAREMBA W,SUTSKEVER I.An empirical exploration of recurrent network architectures[C]//Proc of International Conference on Machine Learning.Lille:JMLR.org,2015:2342-2350.
[14]GAROFOLO J S,LAMEL L F,FISHER W M,et al.TIMIT acoustic-phonetic continuous speech corpus[EB/OL].[2018-04-25].https://catalog.ldc.upenn.edu/LDC93S1.
[15]HU G.“100 nonspeech environmental sounds,2004”[EB/OL].[2018-04-25].http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html.
[16]VARGA A,STEENEKEN H J M.Assessment for automatic speech recognition:II.NOISEX-92:a database and an experiment to study the effect of additive noise on speech recognition systems[J].Speech Communication,1993,12(3):247-251.
[17]RIX A W,BEERENDS J G,Hollier M P,et al.Perceptual evaluation of speech quality(PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]//Proc of IEEE International Conference on Acoustics,Speech,and Signal Processing.Salt Lake City:IEEE,2001:749-752.
[18]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J].IEEE Transactions on Audio,Speech,and Language Processing,2011,19(7):2125-2136.