基于深度循环网络的声纹识别方法研究及应用

英文篇名：Research and application of deep recurrent neural networks based voiceprint recognition
作者：余玲飞 ; 刘强
英文作者：Yu Lingfei;Liu Qiang;Hangzhou College of Commerce,Zhejiang Gongshang University;School of Computer Science & Engineering,University of Electronic Science & Technology of China;
关键词：声纹识别 ; 深度循环网络 ; 卷积神经网络 ; 语谱图
英文关键词：voiceprint recognition;;deep RNN;;convolutional neural network(CNN);;spectrogram
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：浙江工商大学杭州商学院;电子科技大学计算机科学与工程学院;
出版日期：2018-02-08 17:15
出版单位：计算机应用研究
年：2019
期：v.36;No.327
基金：国家自然科学基金资助项目(61370204);; 浙江省自然科学基金资助项目(LQ16F02001)
语种：中文;
页：JSYJ201901036
页数：6
CN：01
ISSN：51-1196/TP
分类号：159-164

摘要

声纹识别是当前热门的生物特征识别技术之一,能够通过说话人的语音识别其身份。针对声纹识别技术进行了研究,提出了一种基于卷积神经网络(CNN)和深度循环网络(RNN)的声纹识别方案CDRNN。CDRNN结合了CNN和RNN的优势,可用于移动终端声纹识别。CDRNN将说话者的原始语音信息经过一系列的处理并生成一张二维语谱图,利用CNN长于处理图像的优势从语谱图中提取语音信号的个性特征,这些个性特征再输入到deep RNN中完成声纹识别,从而确定说话者的身份。实验结果表明了CDRNN方案能够获得比GMMUBM等其他方案更好的识别准确率。
Voiceprint recognition is one of the most popular biometric identification technologies,which can identify a speaker based on his voice. This paper proposed CDRNN,a voiceprint recognition scheme. CDRNN combined CNN and deep RNN into a unified model and took advantages of both of them. For CNN was good at extracting characteristics from images,it could generate several spectrograms based on the original voice signal at first. And then,CNN would extract unique features from these spectrograms. Finally,deep RNN would output the speaker's identification based on these unique features. Simulation results show that CDRNN performs better than GMM-UBM and DNN-based approach.

引文

[1] Jain A,Ross A,Prabhakar S. An introduction to biometric recognition[J]. IEEE Trans on Circuits&Systems for Video Technology,2004,14(1):4-20.
    [2] Furui S. Recent advances in speaker recognition[J]. Pattern Recognition Letters,1997,18(9):859-872.
    [3]林琳,陈虹,陈建.基于鲁棒听觉特征的说话人识别[J].电子学报,2013,41(3):619-625.(Lin Lin,Cheng Hong,Chen Jian.Speaker recognition based on robust auditory feature[J]. Acta Electronica Sinica,2013,41(3):619-625.)
    [4] Hermansky H. Perceptual linear predictive(PLP)analysis of speech[J]. Journal of the Acoustical Society of America,1990,87(4):1738-1752.
    [5] Vergin R,O’Shaughnessy D,Farhat A. Generalized Mel frequency cepstral coefficients for large-vocabulary speaker-independent continuousspeech recognition[J]. IEEE Trans on Speech&Audio Processing,1999,7(5):525-532.
    [6]曹洁,余丽珍.基于MFCC和运动强度聚类初始化的多说话人识别[J].计算机应用研究,2012,29(9):3295-3298.(Cao Jie,Yu Lizhen. Multi-speaker recognition based on MFCC and motion intensity clustering initialization[J]. Application Research of Computers,2012,29(9):3295-3298.)
    [7] Dutta T. Dynamic time warping based approach to text-dependent speaker identification using spectrograms[C]//Proc of Congress on Image and Signal Processing. Washington DC:IEEE Computer Society,2008:354-360.
    [8] Gersho A,Gray R M. Vector quantization and signal compression[M]. Norwell,MA:Kluwer Academic Publishers,1991.
    [9] Gardner M W,Dorling S R. Artificial neural networks:a review of applications in the atmospheric sciences[J]. Atmospheric Environment,1998,32(14-15):2627-2636.
    [10]Jain A,Mao Jianchang,Mohiuddin K M. Artificial neural networks:a tutorial[J]. Computing,1996,29(3):31-44.
    [11]Reynolds D A,Rose R C. Robust text-independent speaker identification using Gaussian mixture speaker models[J]. IEEE Trans on Speech&Audio Processing,1995,3(1):72-83.
    [12]Reynolds D A,Quatieri T F,Dunn R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Processing,2000,10(1-3):19-41.
    [13]Schmidhuber J. Deep learning in neural networks:an overview[J].Neural Networks,2014,61(1):85-117.
    [14]Abdel-Hamid O,Mohamed A,Jiang Hui,et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway,NJ:IEEE Press,2012:4277-4280.
    [15]Simonyan K,Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2015-04-10). https://arxiv. org/abs/1409. 1556.
    [16] Palaz D,Magimai-Doss M,Collobert R. Analysis of CNN-based speech recognition system using raw speech as input[C]//Proc of Annual Conference of International Speech Communication Association. Piscataway,NJ:IEEE Press,2015:11-15.
    [17]Richardson F,Reynolds D,Dehak N. Deep neural network approaches to speaker and language recognition[J]. IEEE Signal Processing Letters,2015,22(10):1671-1675.
    [18]Phapatanaburi K,Wang Longbiao,Sakagami R,et al. Distant-talking accent recognition by combining GMM and DNN[J]. Multimedia Tools&Applications,2016,75(9):5109-5124.
    [19]Kanagasundaram A,Dean D,Sridharan S,et al. DNN based speaker recognition on short utterances[EB/OL].(2016-10-11). https://arxiv. org/abs/1610. 03190.
    [20]Graves A,Mohamed A R,Hinton G. Speech recognition with deep recurrent neural networks[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway,NJ:IEEE Press,2013:6645-6649.
    [21]Sak H,Senior A,Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[EB/OL].(2014-02-05). https://arxiv. org/abs/1402. 1128.
    [22] Wang Jiang,Yang Yi,Mao Junhua,et al. CNN-RNN:a unified framework for multi-label image classification[C]//Proc of IEEE Conference on Computer Vision and Pattern Recognition. Piscataway,NJ:IEEE Press,2016:2285-2294.
    [23]Fan Yin,Lu Xiangju,Li Dian,et al. Video-based emotion recognition using CNN-RNN and C3D hybrid networks[C]//Proc of the 18th ACM International Conference on Multimodal Interaction. New York:ACM Press,2016:445-450.
    [24]Jiang Haohao,Lu Yao,Xue Jing. Automatic soccer video event detection based on a deep neural network combined CNN and RNN[C]//Proc of the 28th IEEE International Conference on Tools with Artificial Intelligence. Piscataway,NJ:IEEE Press,2016:490-494.
    [25]Tensor Flow[EB/OL]. https://tensorflow. google. cn/.
    [26]Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.