摘要
为了从带噪信号中得到纯净的语音信号,提出了一种采用性别相关模型的单通道语音增强算法。具体而言,在训练阶段,分别训练了与性别相关的深度神经网络-非负矩阵分解模型用于估计非负矩阵分解中的权重参数;在测试阶段,提出了一种基于非负矩阵分解和组稀疏惩罚的算法用于判断测试语音中说话人的性别信息,然后再采用对应的模型估计权重,并结合已训练好的字典进行语音增强。实验结果表明所提算法在噪声抑制量及语音质量上,均优于一些基于非负矩阵分解的算法和基于深度神经网络的算法。
In order to obtain the clean speech from the noisy signal, a single-channel speech enhancement algorithm based on gender-related models is proposed. Specifically, in the training stage, Deep Neural Networks(DNN) and Nonnegative Matrix Factorization(NMF) are employed to train two gender-related DNN-NMF models using the genderspecific training data. In the test stage, an algorithm based on NMF and group sparsity penalty is proposed to identify the gender information of the speaker in the test signal. Then the corresponding DNN-NMF model is used to estimate the activations for speech enhancement. Experimental results show that the proposed algorithm performs better in suppressing the noises without decreasing the speech quality compared with other NMF-based and DNN-based methods.
引文
1 Loizou P C. Speech enhancement:theory and practice.CRC press, 2013
2杨琳,张建平,颜永红.单通道语音增强算法对汉语语音可懂度影响的研究.声学学报,2010; 16(2):248-253
3 Roweis S T. One microphone source separation. Advances in Neural Information Processing Systems, 2000:793-799
4 Sreenivas T, Kirnapure P. Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Speech Audio Process., 1996; 4(5):383-389
5 Lee D D, Seung H S. Learning the parts of objects by nonnegative matrix factorization. Nature, 1999; 401(6755):788-791
6李轶南,张雄伟,贾冲,陈亮,曾理.稀疏低秩噪声模型下无监督实时单通道语音增强算法.声学学报,2015; 40(4):607-614
7 Paris S, Bhiksha R, Madhusudana S. Supervised and semisupervised of sounds from single-channel mixtures. International Conference on Independent Component Analysis and Signal Separation, 2007:414-421
8 Schmidt M N, Olsson R K. Single-channel speech separation using sparse non-negative matrix factorization. ISCA International Conference on Spoken Language Processing(Interspeech), 2006:2614-2617
9 Virtanen T, Gemmeke J F, Raj B, Smaragdis P. Compositional models for audio processing:Uncovering the structure of sound mixtures. IEEE Signal Process. Mag., 2015;32(2):125-144
10黄建军,张雄伟,张亚非,邹霞.时频字典学习的单通道语音增强算法.声学学报,2012; 37(5):539-547
11 Xu Y, Du J, Dai L, Lee C. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett., 2014; 21(1):65-68
12 Wang Y, Narayanan A, Wang D. On training targets for supervised speech separation. IEEE Trans. Audio Speech Lang. Process., 2014; 22(12):1849-1858
13 Huang P L, Kim M, Hasegawa-Johnson M, Smaragdis P.Deep learning for monaural speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2014:1562-1566
14 Wang Y, Wang D. A structure-preserving training target for supervised speech separation. IEEE International Conference on Acoustic, Speech and Signal Processing(ICASSP), 2014:6148-6152
15 Kang T G, Kwon K, Shin J W, Kim N S. NMF-based target source separation using deep neural network. IEEE Signal Process. Lett., 2015; 22(2):229-233
16 Nie S, Liang S, Zhang X L, Yang Z L, Liu W J. Exploiting spectro-temporal structures using NMF for DNN-based supervised speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP),2016:469-473
17 Vu T T, Bigot B, Chng E S. Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2016:499-503
18 Liang S, Liu W, Jiang W, Xue W. The optimal ratio timefrequency mask for speech separation in terms of the signalto-noise ratio. J. Acoust. Soc. Am., 2013; 134(5):452-458
19 Liang S, Liu W, Jiang W, Xue W. The analysis of the sim-plification from the ideal ratio to binary mask in signalto-noise ratio sense. Speech Communication, 2014; 59(6):22-30
20 Virtanen T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process., 2007; 15(3):1066-1074
21 Lee D D, Seung H S. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 2001:556-562
22 Glorot X, Bordes A, Bengio Y. Deep sparse rectifier networks. The Proceedings of the 14 th International Conference on Artificial Intelligence and Statistics, 2011; 15(8):315-323
23 Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming,1989; 45(1-3):503-528
24 Sun D L, Mysore G J. Universal speech models for speakerindependent single channel source separation. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2013:141-145
25 Garofolo J S. TIMIT:acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 1993
26 Varga A, Steeneken H J. Assessment for automatic speech recognition:Noisex-92:A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 1993; 12(3):247-251
27 Vincent E, Gribonval R, Fevotte C. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process., 2006; 14(4):1462-1469
28 Rix A W, Beerends J G, Hollier M P, Hekstra A P. Perceptual evaluation of speech quality(pesq)-a new method for speech quality assessment of telephone networks and codecs. IEEE International Conference on Acoustics,Speech and Signal Processing, 2001:749-752