摘要
基于深度神经网络的语音增强模型的训练一般采用均方误差作为代价函数,没有针对语音增强问题进行优化。针对这一问题,从相邻帧网络输出之间的相关性和各时频单元的语音存在情况两方面进行考虑;通过在代价函数中对相邻帧的网络输出进行关联,并设计一个反映时频单元语音存在情况的感知系数,提出了一种感知联合优化的深度神经网络语音增强方法。实验结果表明,相比基于均方误差的语音增强方法,该方法显著地提高了增强语音的语音质量和可懂度,具有更好的语音增强性能。
In the training of speech enhancement models based on the deep neural network(DNN),the mean square error is generally adopted as the cost function,which is not optimized for the speech enhancement problem.In view of this problem,to consider the correlation between the adjacent frames of the network's output and the presence of the speech component in each time-frequency unit,by correlating the adjacent frames of the network's output and designing aperceptual coefficient related to the presence of the speech component in time-frequency units in the cost function,a speech enhancement method based on the joint optimization DNN is proposed.Experimental results show that compared with the speech enhancement method based on the mean square error,the proposed method significantly improves the quality and intelligibility of the enhanced speech and has a better speech enhancement performance.
引文
[1]WANG D M,HANSEN J H L.Single Channel Speech Enhancement Based on Harmonic Estimation Combined with Statistical Based Method to Improve Speech Intelligibility for Cochlear Implant Recipients[J].Acoustical Society of America Journal,2017,141(5):3985-3986.
[2]刘文举,聂帅,梁山,等.基于深度学习语音分离技术的研究现状与进展[J].自动化学报,2016,42(6):819-833.LIU Wenju,NIE Shuai,LIANG Shan,et al.Deep Learning Based Speech Separation Technology and Its Developments[J].Acta Automatica Sinica,2016,42(6):819-833.
[3]XU Y,DU J,DAI L R,et al.An Experimental Study on Speech Enhancement Based on Deep Neural Networks[J].IEEE Signal Processing Letters,2014,21(1):65-68.
[4]XU Y,DU J,DAI L R,et al.A Regression Approach to Speech Enhancement Based on Deep Neural Networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(1):7-19.
[5]CHEN J,WANG Y,YOHO S E,et al.Large-scale Training to Increase Speech Intelligibility for Hearing-impaired Listeners in Novel Noises[J].Journal of the Acoustical Society of America,2016,139(5):2604-2612.
[6]CHEN J,WANG D.Long Short-term Memory for Speaker Generalization in Supervised Speech Separation[J].Journal of the Acoustical Society of America,2017,141(6):4705-4714.
[7]WILLIAMSON D S,WANG D L.Time-frequency Masking in the Complex Domain for Speech Dereverberation and Denoising[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,25(7):1492-1501.
[8]袁文浩,孙文珠,夏斌,等.利用深度卷积神经网络提高未知噪声下的语音增强性能[J].自动化学报,2018,44(4):751-759.YUAN Wenhao,SUN Wenzhu,XIA Bin,et al.Improving Speech Enhancement in Unseen Noise Using Deep Convolutional Neural Network[J].Acta Automatica Sinica,2018,44(4):751-759.
[9]LOIZOU P C.Speech Enhancement Based on Perceptually Motivated Bayesian Estimators of the Magnitude Spectrum[J].IEEE Transactions on Speech and Audio Processing,2005,13(5):857-869.
[10]GAROFOLO J S,LAMEL L F,FISHER W M,et al.TIMIT Acoustic-Phonetic Continuous Speech Corpus:LDC93S1[R/OL].[2018-09-10].https://catalog.ldc.upenn.edu/LDC93S1.
[11]HU G N.100Nonspeech Environmental Sounds[S/OL].[2018-09-03].http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html.
[12]VARGA A,STEENEKEN H J M.Assessment for Automatic Speech Recognition:II.NOISEX-92:a Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems[J].Speech Communication,1993,12(3):247-251.
[13]YU D,EVERSOLE A,SELTZER M,et al.An Introduction to Computational Networks and the Computational Network Toolkit:MSR-TR-2014-112[R/OL].[2018-09-10].https://www.microsoft.com/en-us/research/publication/an-introduction-to-computational-networks-and-the-computational-network-toolkit/.
[14]RIX A W,BEERENDS J G,HOLLIER M P,et al.Perceptual Evaluation of Speech Quality(PESQ)-A New Method for Speech Quality Assessment of Telephone Networks and Codecs[C]//Proceedings of the 2001IEEE International Conference on Acoustics,Speech,and Signal Processing.Piscataway:IEEE,2001:749-752.
[15]TAAL C H,HENDRIKS R C,HEUSDENS R,et al.An Algorithm for Intelligibility Prediction of Time-frequency Weighted Noisy Speech[J].IEEE Transactions on Audio,Speech,and Language Processing,2011,19(7):2125-2136.