A Deep Learning Method for Chinese Singer Identification

英文篇名：A Deep Learning Method for Chinese Singer Identification
作者：Zebang ; Shen ; Binbin ; Yong ; Gaofeng ; Zhang ; Rui ; Zhou ; Qingguo ; Zhou
英文作者：Zebang Shen;Binbin Yong;Gaofeng Zhang;Rui Zhou;Qingguo Zhou;the School of Information Science and Engineering, Lanzhou University;
英文关键词：singer identification;;timbre modeling;;deep learning;;long short-term memory
中文刊名：QHDY
英文刊名：清华大学学报自然科学版(英文版)
机构：the School of Information Science and Engineering, Lanzhou University;
出版日期：2019-04-10
出版单位：Tsinghua Science and Technology
年：2019
期：v.24
基金：supported by the National Natural Science Foundation of China(Nos.61402210 and 60973137);; the Program for New Century Excellent Talents in University(No.NCET-12-0250);; the Major Project of HighResolution Earth Observation System(No.30-Y20A34-9010-15/17);; the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDA03030100);; the Gansu Sci.&Tech.Program(Nos.1104GKCA049,1204GKCA061,and 1304GKCA018);; the Fundamental Research Funds for the Central Universities(No.lzujbky-2016-140);; the support of NVIDIA Corporation with the donation of the Jetson TX1 used for this research
语种：英文;
页：QHDY201904001
页数：8
CN：04
ISSN：11-3745/N
分类号：3-10

摘要

As a subfield of Multimedia Information Retrieval(MIR), Singer IDentification(SID) is still in the research phase. On one hand, SID cannot easily achieve high accuracy because the singing voice is difficult to model and always disturbed by the background instrumental music. On the other hand, the performance of conventional machine learning methods is limited by the scale of the training dataset. This study proposes a new deep learning approach based on Long Short-Term Memory(LSTM) and Mel-Frequency Cepstral Coefficient(MFCC) features to identify the singer of a song in large datasets. The results of this study indicate that LSTM can be used to build a representation of the relationships between different MFCC frames. The experimental results show that the proposed method achieves better accuracy for Chinese SID in the MIR-1 K dataset than the traditional approaches.
As a subfield of Multimedia Information Retrieval(MIR), Singer IDentification(SID) is still in the research phase. On one hand, SID cannot easily achieve high accuracy because the singing voice is difficult to model and always disturbed by the background instrumental music. On the other hand, the performance of conventional machine learning methods is limited by the scale of the training dataset. This study proposes a new deep learning approach based on Long Short-Term Memory(LSTM) and Mel-Frequency Cepstral Coefficient(MFCC) features to identify the singer of a song in large datasets. The results of this study indicate that LSTM can be used to build a representation of the relationships between different MFCC frames. The experimental results show that the proposed method achieves better accuracy for Chinese SID in the MIR-1 K dataset than the traditional approaches.

引文

[1]S.Masood,J.S.Nayal,and R.K.Jain,Singer identification in Indian Hindi songs using MFCC and spectral features,in Proc.IEEE 1stInt.Conf.Power Electronics,Intelligent Control and Energy Systems,Delhi,India,2016,pp.1-5.
    [2]E.Dupraz and G.Richard,Robust frequency-based audio fingerprinting,in Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,Dallas,TX,USA,2010,pp.281-284.
    [3]A.Schindler and A.Rauber,A music video information retrieval approach to artist identification,in Proc.10th Int.Symp.Computer Music Multidisciplinary Research,Marseille,France,2013.
    [4]W.Cai,Q.Li,and X.Guan,Automatic singer identification based on auditory features,in Proc.7thInt.Conf.Natural Computation,Shanghai,China,2011,pp.1624-1628.
    [5]H.A.Patil,P.G.Radadia,and T.K.Basu,Combining evidences from mel cepstral features and cepstral mean subtracted features for singer identification,in Proc.Int.Conf.Asian Language Processing,Hanoi,Vietnam,2012,pp.145-148.
    [6]B.Whitman,G.Flake,and S.Lawrence,Artist detection in music with Minnowmatch,in Proc.IEEE Workshop on Neural Networks for Signal Processing,North Falmouth,MA,USA,2001,pp.559-568.
    [7]N.C.Maddage,C.S.Xu,and Y.Wang,Singer identification based on vocal and instrumental models,in Proc.17thInt.Conf.Pattern Recognition,Cambridge,UK,2004,pp.375-378.
    [8]Y.E.Kim and B.Whitman,Singer identification in popular music recordings using voice coding features,in Proc.3rdInt.Conf.Music Information Retrieval,Paris,France,2002,pp.164-169.
    [9]G.E.Hinton and R.R.Salakhutdinov,Reducing the dimensionality of data with neural networks,Science,vol.313,no.5786,pp.504-507,2006.
    [10]Y.LeCun,Y.Bengio,and G.Hinton,Deep learning,Nature,vol.521,no.7553,pp.436-444,2015.
    [11]G.Hinton,L.Deng,D.Yu,G.Dahl,A.R.Mohamed,N.Jaitly,A.Senior,V.Vanhoucke,P.Nguyen,T.Sainath,et al.,Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups,IEEE Signal Processing Magazine,vol.29,no.6,pp.82-97,2012.
    [12]A.Graves,A.R.Mohamed,and G.Hinton,Speech recognition with deep recurrent neural networks,in Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,Vancouver,Canada,2013,pp.6645-6649.
    [13]Z.Shen,B.Yong,G.Zhang,R.Zhou,and Q.Zhou,Adeep learning method for Chinese singer identification,in Sixth International Conference on Advanced Cloud and Big Data,Lanzhou,China,2018.
    [14]S.Hochreiter and J.Schmidhuber,Long short-term memory,Neural Computation,vol.9,no.8,pp.1735-1780,1997.
    [15]I.Goodfellow,Y.Bengio,and A.Courville,Deep Learning.Cambridge,MA,USA:MIT Press,2016.
    [16]F.A.Gers,J.Schmidhuber,and F.Cummins,Learning to forget:Continual prediction with LSTM,Neural Computation,vol.12,no.10,pp.2451-2471,2000.
    [17]D.P.Kingma and J.Ba,Adam:A method for stochastic optimization,arXiv preprint arXiv:1412.6980,2014.
    [18]P.Mermelstein,Distance measures for speech recognition,psychological and instrumental,in Pattern Recognition and Artificial Intelligence,R.C.H.Chen,ed.Academic Press,1976,pp.374-388.
    [19]S.Davis and P.Mermelstein,Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,IEEE Transactions on Acoustics,Speech,and Signal Processing,vol.28,no.4,pp.357-366,1980.
    [20]M.Sahidullah and G.Saha,Design,analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,Speech Communication,vol.54,no.4,pp.543-565,2012.
    [21]T.Zhang,Automatic singer identification,in Proc.2003Int.Conf.Multimedia and Expo,Baltimore,MD,USA,2003,pp.1-33.
    [22]Y.Hu and G.Z.Liu,Automatic singer identification using missing feature methods,in Proc.IEEE Int.Conf.Multimedia and Expo,San Jose,CA,USA,2013,pp.1-6.
    [23]X.Glorot,A.Bordes,and Y.Bengio,Deep sparse rectifier neural networks,in Proc.14thInt.Conf.Artificial Intelligence and Statistics,Fort Lauderdale,FL,USA,2011,pp.315-323.
    [24]N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,and R.Salakhutdinov,Dropout:A simple way to prevent neural networks from overfitting,Journal of Machine Learning Research,vol.15,no.1,pp.1929-1958,2014.
    [25]L.Prechelt,Automatic early stopping using cross validation:Quantifying the criteria,Neural Networks,vol.11,no.4,pp.761-767,1998.
    [26]Y.Z.Zhou,D.Zhang,and N.X.Xiong,Post-cloud computing paradigms:A survey and comparison,Tsinghua Science and Technology,vol.22,no.6,pp.714-732,2017.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700