采用性别相关的深度神经网络及非负矩阵分解模型用于单通道语音增强

英文篇名：Single-channel speech enhancement based on gender-related deep neural networks and non-negative matrix factorization models
作者：李煦 ; 王子腾 ; 王晓飞 ; 付强 ; 颜永红
英文作者：LI Xu;WANG Ziteng;WANG Xiaofei;FU Qiang;YAN Yonghong;Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics,Chinese Academy of Sciences;University of Chinese Academy of Sciences;Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences;
中文刊名：XIBA
英文刊名：Acta Acustica
机构：中国科学院声学研究所语言声学与内容理解重点实验室;中国科学院大学;中国科学院新疆理化技术研究所新疆民族语音语言信息处理实验室;
出版日期：2019-03-15
出版单位：声学学报
年：2019
期：v.44
基金：国家自然科学基金项目(11461141004,61271426,U1536117,11504406,11590770-4);; 中国科学院战略性先导科技专项项目(XDA06030100,XDA06030500,XDA06040603);; 国家863计划项目(2015AA016306);; 国家973计划项目(2013CB329302);; 新疆维吾尔自治区科技重大专项项目(201230118-3)资助
语种：中文;
页：XIBA201902009
页数：10
CN：02
ISSN：11-2065/O4
分类号：79-88

摘要

为了从带噪信号中得到纯净的语音信号,提出了一种采用性别相关模型的单通道语音增强算法。具体而言,在训练阶段,分别训练了与性别相关的深度神经网络-非负矩阵分解模型用于估计非负矩阵分解中的权重参数;在测试阶段,提出了一种基于非负矩阵分解和组稀疏惩罚的算法用于判断测试语音中说话人的性别信息,然后再采用对应的模型估计权重,并结合已训练好的字典进行语音增强。实验结果表明所提算法在噪声抑制量及语音质量上,均优于一些基于非负矩阵分解的算法和基于深度神经网络的算法。
In order to obtain the clean speech from the noisy signal, a single-channel speech enhancement algorithm based on gender-related models is proposed. Specifically, in the training stage, Deep Neural Networks(DNN) and Nonnegative Matrix Factorization(NMF) are employed to train two gender-related DNN-NMF models using the genderspecific training data. In the test stage, an algorithm based on NMF and group sparsity penalty is proposed to identify the gender information of the speaker in the test signal. Then the corresponding DNN-NMF model is used to estimate the activations for speech enhancement. Experimental results show that the proposed algorithm performs better in suppressing the noises without decreasing the speech quality compared with other NMF-based and DNN-based methods.

引文

1 Loizou P C. Speech enhancement:theory and practice.CRC press, 2013
    2杨琳,张建平,颜永红.单通道语音增强算法对汉语语音可懂度影响的研究.声学学报,2010; 16(2):248-253
    3 Roweis S T. One microphone source separation. Advances in Neural Information Processing Systems, 2000:793-799
    4 Sreenivas T, Kirnapure P. Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Speech Audio Process., 1996; 4(5):383-389
    5 Lee D D, Seung H S. Learning the parts of objects by nonnegative matrix factorization. Nature, 1999; 401(6755):788-791
    6李轶南,张雄伟,贾冲,陈亮,曾理.稀疏低秩噪声模型下无监督实时单通道语音增强算法.声学学报,2015; 40(4):607-614
    7 Paris S, Bhiksha R, Madhusudana S. Supervised and semisupervised of sounds from single-channel mixtures. International Conference on Independent Component Analysis and Signal Separation, 2007:414-421
    8 Schmidt M N, Olsson R K. Single-channel speech separation using sparse non-negative matrix factorization. ISCA International Conference on Spoken Language Processing(Interspeech), 2006:2614-2617
    9 Virtanen T, Gemmeke J F, Raj B, Smaragdis P. Compositional models for audio processing:Uncovering the structure of sound mixtures. IEEE Signal Process. Mag., 2015;32(2):125-144
    10黄建军,张雄伟,张亚非,邹霞.时频字典学习的单通道语音增强算法.声学学报,2012; 37(5):539-547
    11 Xu Y, Du J, Dai L, Lee C. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett., 2014; 21(1):65-68
    12 Wang Y, Narayanan A, Wang D. On training targets for supervised speech separation. IEEE Trans. Audio Speech Lang. Process., 2014; 22(12):1849-1858
    13 Huang P L, Kim M, Hasegawa-Johnson M, Smaragdis P.Deep learning for monaural speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2014:1562-1566
    14 Wang Y, Wang D. A structure-preserving training target for supervised speech separation. IEEE International Conference on Acoustic, Speech and Signal Processing(ICASSP), 2014:6148-6152
    15 Kang T G, Kwon K, Shin J W, Kim N S. NMF-based target source separation using deep neural network. IEEE Signal Process. Lett., 2015; 22(2):229-233
    16 Nie S, Liang S, Zhang X L, Yang Z L, Liu W J. Exploiting spectro-temporal structures using NMF for DNN-based supervised speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP),2016:469-473
    17 Vu T T, Bigot B, Chng E S. Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2016:499-503
    18 Liang S, Liu W, Jiang W, Xue W. The optimal ratio timefrequency mask for speech separation in terms of the signalto-noise ratio. J. Acoust. Soc. Am., 2013; 134(5):452-458
    19 Liang S, Liu W, Jiang W, Xue W. The analysis of the sim-plification from the ideal ratio to binary mask in signalto-noise ratio sense. Speech Communication, 2014; 59(6):22-30
    20 Virtanen T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process., 2007; 15(3):1066-1074
    21 Lee D D, Seung H S. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 2001:556-562
    22 Glorot X, Bordes A, Bengio Y. Deep sparse rectifier networks. The Proceedings of the 14 th International Conference on Artificial Intelligence and Statistics, 2011; 15(8):315-323
    23 Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming,1989; 45(1-3):503-528
    24 Sun D L, Mysore G J. Universal speech models for speakerindependent single channel source separation. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2013:141-145
    25 Garofolo J S. TIMIT:acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 1993
    26 Varga A, Steeneken H J. Assessment for automatic speech recognition:Noisex-92:A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 1993; 12(3):247-251
    27 Vincent E, Gribonval R, Fevotte C. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process., 2006; 14(4):1462-1469
    28 Rix A W, Beerends J G, Hollier M P, Hekstra A P. Perceptual evaluation of speech quality(pesq)-a new method for speech quality assessment of telephone networks and codecs. IEEE International Conference on Acoustics,Speech and Signal Processing, 2001:749-752

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700