基于JFA的汉语耳语音说话人识别

英文题名：Speaker Identification in Chinese Whispered Speech Based on Simplified Joint Factor Analysis
作者：王琰蕾
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：耳语音 ; 说话人识别 ; 联合因子分析 ; 说话人因子 ; 信道因子
英文关键词：Whispered Speech ; Speaker Recognition ; Joint Factor Analysis ; Speaker Factors ; Channel Factors
学位年度：2010
导师：赵鹤鸣
学科代码：081001
学位授予单位：苏州大学
论文提交日期：2010-04-01

摘要

耳语音说话人识别在公共场合下的通讯、安全场所的身份鉴定、罪犯识别、电话网络查询与电话银行等领域都有着一定的实用价值。它是一个较新的研究课题,有许多问题尚待解决。
     由于耳语发音方式的特殊性加上耳语通话常常在手机方式下进行,耳语音说话人识别受说话人发音状态、健康状况、心理因素及信道环境因素的影响变得更为突出。因此,用正常音建立的说话人识别系统对耳语音说话人识别基本不适用,识别性能将大为下降。
     目前已有的自适应补偿方法都将说话人变化和信道环境变化这两种因素混在一起,不加区分,这样的处理方式必然会影响耳语音说话人识别的识别效果。为此,有必要针对耳语音的特点,建立合适的识别模型来实现文本无关的耳语音说话人识别。本文提出采用联合因子分析(JFA)的方法来解决耳语发音时受多种因素影响说话人语音特征变异大的问题,该方法针对耳语音的特点引入了两类变化因子:说话人自身变化因子和通话信道环境变化因子。
     鉴于联合因子分析的难点,本文提出了一种适用于耳语音说话人识别的简化的联合因子分析方法,其最主要的特点是分开估计说话人空间和信道空间,因此在算法的复杂度和语音数据的需求量上都有很大的下降,从而大大降低了运算量和运算时间。本文建立了一种基于简化的JFA方法的识别模型,并且给出了相应的算法,在此基础上实现了耳语发音方式下与文本无关的说话人辨认。
     对本文提出的简化的JFA识别模型在8种不同的信道环境情况下分别进行测试,实验证明,该模型在信道失配的情况下也能有效地辨认耳语音说话人,并与已有的采用MAP、特征映射(Feature Mapping)和说话人模型合成(SMS)方法的GMM模型进行比较,识别正确率有了明显的提高。
     此外,还研究了说话人因子数和信道因子数对该识别模型性能的影响,实验发现,适当地增加说话人因子数和信道因子数有助于提高识别的正确率,但是两者均存在着一定的饱和问题,即继续增加说话人因子数和信道因子数对识别模型的性能几乎没有任何提高。
Whispered speech is the mode of speech defined as speaking softly with no vibration of the vocal cords to avoid being overheard. The whispering speaker recognition can be applied in several fields, such as the private speech communication in public, the special need for the forensic work, etc.
     Since speaker recognition of whispered speech is in the early stage research, many models which are often used in normal speech are still used. However, most of them are not suitable for whispered speech because of its characteristics.
     At present, the available adaptive compensation methods make no distinction between the speaker health, psychological factors and the channel environment factors, which will definitely affect the recognition results of whispered speech.
     As to whispered speech, without the vibration of the vocal cords, it is always in low SNR. The locations, energy of the formants and the auditory model in whispered speech are different from those in normal speech. When whispering, the mentality of the enunciator is varied and susceptible. Hence, speaker recognition of whispered speech becomes more sophisticated compared to the normal speech. Concerns are how to decrease the influence of speaking environment, especially the variations of speech channels; and how to remove the mental or emotional affections.
     For the characteristics of whispered speech, this paper presents a new approach to speaker identification of Chinese whisperd speech which called simplified joint factor analysis. The main idea of the proposed technique decoupled estimates the speaker space and channel space, which removes the necessity of labeling databases for channel, simplifies the training procedure and also reduces the computation and the demanding of data sets.
     Experiments are carried on our own database. This corpus consists of 100 target speakers, 80 male and 20 female, in which each speaker is recorded over 8 typical channels. Compared with different recognition methods, such as MAP, Feature Mapping + MAP and SMS, the proposed JFA technique which we presented in this paper does provide superior performance and significant speedup in speaker identification of Chinese whispered speech. Especially, it does greatly improve the recognition accuracy when the enrollment and test conditions are mismatched.
     Studying on the number of speaker factors and channel factors shows that increase the number of the factors properly can improve the recognition accuracy effectively, but there is a problem called saturation. That is to say keeping on increasing the number of the factors can not improve the performance of the whispered speaker recognition system.

引文

[1] L. Liu, J. He and G. Palm. A comparison of human and machine in speaker recognition. Proc. of the European Conference on Speeeh Communication and Technology(EUROSPEECH), 1999.
    [2] A. Sehmidt-Nielsen and T. H. Crystal. Speaker verification human listeners: experiments comparing human and machine performance using the NIST 1998 speaker evaluation data. Digital Signal Processing, 2000, 10: 249-266.
    [3]吴玺宏.声纹识别听声辨认.计算机世界, 2001, (8).
    [4]解焱陆.基于特征变换和分类的文本无关电话语音说话人识别研究.中国科学技术大学博士论文, 2007.
    [5]林平澜,王仁华.动态HMM及其在说话人识别中的应用.信号处理学报, 1993, 9(4): 250-256.
    [6] R. A. Sukkar, M. B. Gandhi, A. R. Setlur. Speaker verification using mixture decomposition discrimination. IEEE Trans. on Speech and Audio Processing, 2000, 8(3), 292-299.
    [7] K. R. Farrell, R. J. Mammone, K. T. Assaleh. Speaker recognition using neural networks and conventional classifiers. IEEE Trans. on Speeeh and Audio Proeessing, 1994, 2(1), 194-204.
    [8]史静朴等.用神经计算机的说话人确认系统及其应用.电子学报, 1999, 27(10): 27-29.
    [9] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. on Speech and Audio Processing, 1995, 3(1): 72-83.
    [10] D. A. Reynolds. Speaker identification and verification using Ganssian mixture speaker models. Speeeh Communication, 1995, 17: 91-108.
    [11] D. A. Reynolds, Thomas F. Quatieri and Robert B. Dunn. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Processing, Academic Press, 2000, 10: 19-41.
    [12] D. A. Reynolds, Walter Andrews, Joseph Campbell, etc. The super SID Project exploiting high-level information for high-accuracy speaker recognition[A]. Proc. ICASSP[C], 2003.
    [13] S. Fine, J. Navratil and R. A. Gopinath. A hybrid GMM/SVM approach to speaker identification. Proc. of the Intenational Conference on Acoustics, Speech and Signal Processing(ICASSP), 2001.
    [14] S. Fine, J. Navratil and R. A. Gopinath. Enhancing GMM scores using SVM“hints”. Proc. of the European Conference on Speech Communication andTechnology(EUROSPEECH), 2001.
    [15] W. M. Campbell, D. E. Sturim, D. A. Reynolds. Support Vector Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters, 2006, 13(5): 308-311.
    [16] F. Weber, L. Manganaro, B. Peskin and E. Shriberg. Using Prosodic and lexical information for speaker identification. Proc. of the International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2002.
    [17] F. Bimbot, J. Bonastre. A Tutorial on Text-Independent Speaker Verification[J]. EURASIP Journal on Applied Signal Processing, 2004, (4): 430-451.
    [18] D. A. Reynolds. Comparison of background normalization methods for text-independent speaker verification. Proc. of the European Conference on Speeeh Communication and Technology(EUROSPEECH), 1997.
    [19] Karl J. Holzinger and Harry H. Harman. Factor analysis, a synthesis of factorial methods. Chicago. III, The University of Chicago Press, 1941.
    [20] R. Teunen, B. Shahshahani and L. Heck. A model-based transformational approach to robust speaker recognition. Proc. of the International Conference on Speech and Language Processing(ICSLP), (Beijing), 2000.
    [21] L. Heck and N. Mirghafori. On-line unsupervised adaptation in speaker verification. Proc. of the International Conference on Speech and Language Proeessing(ICSLP), (Beijing), 2000.
    [22] Wei Wu, Thomas Fang Zheng and Mingxing Xu. Cohort-based speaker model synthesis for channel robust speaker recognition. ICASSP, Toulouse, France, 2006, 1:893-896.
    [23] D. A. Reynolds. Channel robust speaker verification via feature mapping. Proc. ICASSP, 2003, 11: 53-56.
    [24] Patrick Kenny, G. Boulianne, P. Ouellet and P. Dumouchel. Speaker and Session variability in GMM-Based Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing, 2007, 15(4): 1448-1460.
    [25] P. Kenny, M. Mihoubi and P. Dumouchel. New MAP Estimators for Speaker Recognition. Proc. EUROSPEECH, 2003: 2964-2967.
    [26]黄伟.基于GMM/SVM和多子系统融合的与文本无关的话者识别.中国科学技术大学博士论文, 2004.
    [27] M. F. Schwartz, H. E. Rine. Identification of speaker sex from isolated, whispered vowels. Journal of Acoustical Society of American, 1968, 44(6): 1736-1737.
    [28] N. J. Lass, K. R. Hughes, M. D. Bowyer, et al. Speaker sex identification from voiced, whispered and filtered isolated vowels. Journal of Acoustical Society of American, 1976, 59(3): 675-678.
    [29] V. C. Tartter. What’s in a whisper? Journal of Acoustical Society of American, 1989,86(5): 1678-1683.
    [30] V. C. Tartter. Identifiability of vowels and speakers from whispered syllables. Perception and Psychophysics, 1991, 49(4): 365-372.
    [31] W. Meyer-Eppler. Realization of Prosodic Features in Whispered Speech. Journal of Acoustical of Amercia, 1957, 29(1): 104-106.
    [32] I. B. Thomas. Perceived pitch of whispered vowels. Journal of Acoustical Society of American, 1969, 46(2): 468-470.
    [33] M. Higashikawa, K. Nakai, A. Sakakura, H. Takahashi. Perceived pitch of whispered vowels-relationship with formant frequencies: a preliminary study. Journal of Voice, 1996. 10(2): 155-158.
    [34] K. J. Kallail and F.W. Emanuel. Formant-frequency differences between isolated whispered and phonated of vowel samples produced by adult female subjects. Journal of Speech and Hearing Research, 1984, 27(2): 245-251.
    [35] I. Eklund and H. Traunmüller. Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetia, 1997, 54(1): 1-21.
    [36] S. T. Jovicic. Formant feature differences between whispered and voiced sustained vowels. Acustica-acta acustica, 1998, 84(4): 739-743.
    [37] M. Matsuda and H. Kasuya. Acoustic nature of the whisper. EUROSPEECH, 1999: 137-140.
    [38]于华.耳语不利于声嘶治疗和嗓音的恢复.中央民族大学学报, 1996, 5(2): 163-166.
    [39] T. Itoh, K. Takeda and F. Itakura. Acoustic analysis and recognition of whispered speech. Proc. ICASSP, Orlando, Florida, USA, 2002: 389-392.
    [40] R. W. Morris. Enhancement and recognition of whispered speech. [PhD Thesis], Georgia Institute of Technology, USA, 2002.
    [41] R. W. Morris and M. A. Clements. Reconstruction of speech from whispers. Medical Engineering & Physics, 2002, 24(8): 515-520.
    [42] Qin Jin, S. Chen, S. Jou and T. Schultz. Whispering Speaker Identification. Proceedings of IEEE International Conference on Multimedia&Expo(ICME), Beijing, 2007: 1027-1030.
    [43] http://www.britac.ac.uk
    [44] http://www.ed.ac.uk
    [45] M. Gao. Tones in Whispered Chinese: Articulatory and Perceptual Cues. [Master], University of Victoria, Canada, 2002.
    [46] S. T. Jovicic and Z. Saric. Acoustic Analysis of Consonants in Whispered Speech. Journal of Voice, 2008, 22(3): 263-274.
    [47] F. Icat and H. Ilk. Investigation on difference between whispered and phonated sustained Turkish Vowels. Proc. of 12th IEEE Signal Processing and CommunicationApplications Conference, 2004: 564-566.
    [48] A. Farzaneh et al. Analysis-by-synthesis method for whisper-speech reconstruction. Proc. of IEEE Asia Pacific Conference on Circuits and System, 2008: 1280-1283.
    [49] H. I. Turkmen and M. E. Karsligil. Normally phonated speech recovery from whispers by MELP. Proc. of 16th IEEE Signal Processing and Communication Applications Conference, 2008: 1-4.
    [50] Li Xueli, Xu Boling. Entropy-based initial/final segmentation for Chinese whispered Speech. Acta Acustica, 2005, 30(1): 69-75.
    [51] Lin Wei, Ynag Lili and Xu Boling. A new frequency scale of Chinese whispered Speech in the application of speaker identification. Progress in Nature Science, 2009, 16(10): 1072-1078.
    [52] Gong Chenghui and Zhao Heming. Formant estimation of whispered speech based on spectral segmentation. Proc. of 6th IEEE International Symposium on Signal Processing and Information, 2006, 1(1): 562-566.
    [53] Lv Gang, Zhao Heming. A Modified Adaptive Algorithm for Formant Bandwidth in Whisper Conversion. Computer and Information Science, 2008, 1(4): 121-126.
    [54]林玮,杨莉莉,徐柏龄.基于修正MFCC参数汉语耳语音的话者识别.南京大学学报(自然科学版), 2006, 42(1): 54-62.
    [55] Gong Chenghui, Zhao Heming, Wang Yanlei, et al. Development of Chinese Whispered Speech Database for Speaker Verification.
    [56]赵力.语音信号处理.北京:机械工业出版社, 2000: 14-20.
    [57]栗学丽,徐柏龄.混响声场语音识别方法研究.南京大学学报(自然科学版), 2003, 39(4): 525-531.
    [58]林宝成,陈永彬.基于ARMA模型的汉语讲话者识别[J].声学学报, 1998, 23(3): 229-234.
    [59] T. F. Qualieri. Discrete-Time Speech Signal Processing. Beijing: Publishing House of Electronics Industry, 2004: 43-560.
    [60] B. H. Juang, L. R. Rabiner, J. G. Wilpon. On the use of bandpass filtering in speech recognition[J]. IEEE Transactions on Acoustic, Speech and Signal Processing, 1987, 35: 871-879.
    [61] R. L.特拉斯克编.《语音学和音系学字典》(A dictionary of Phonetics and Phonology).《语音学和音系学字典》编译组译,语文出版社, 2000: 286.
    [62] K. Tsudado, Y.Ohta, et al. Laryngeal adjustment in whispering: magnetic resonance imaging study. Ann Otol Rhinol Laryngol, 1997, 106: 41-43.
    [63]沙丹青,栗学丽,徐柏龄.耳语音声调特征的研究,电声技术, 2003, 11: 4-7.
    [64] M. F. Sciiwartz. Syllable duration in oral and whispered reading. Journal of Acoustical Society of American, 1967, 41(5): 1367-1369.
    [65]杨莉莉,李燕,徐柏龄.汉语耳语音库的建立与听觉实验研究.南京大学学报(自然科学版), 2005, 41(5): 311-317.
    [66] S. T. Jovicic. Formant feature differences between whispered and voiced sustained vowels. Acustica-acta acustica, 1998, 84(4): 739-743.
    [67] Li Xueli, Xu Boling. Formant comparison between Mandarin whispered and voiced vowels. Acta Acustica United with Acustica, 2005, 91(6): 1-7.
    [68]栗学丽.汉语耳语音转换为正常音的研究.博士论文,南京大学电子科学与工程系声学研究所, 2004.
    [69] S. E. Bou-Ghazale and J. H. L. Hansen. A comparative study of tradition and newly proposed features for recognition of speech under stress. IEEE Transactions of Speech and Signal Processing, 2000, 8(4): 429-442.
    [70] Lori F. Lamel, Lawence R. Rabiner. An Improved Endpoint Detector for Isolated Word Recognition[J]. IEEE Trans. on Acoustics, Speeeh and Signal Processing, 1981, 29(4): 777-785.
    [71] H. Hermansky, N. Morgan, A. Bayya and P. Kohn. RASTA-PLP Speeeh Analysis. In ICSI Technical Report TR-91-069, Berkeley, Califomia, December, 1991.
    [72] Scott Shaobing Chen, Ramesh A. Gopinath. Gaussianization. Available at www.research.ibm.com/people/r/rameshg/chen-nips2000.ps
    [73] Bing Xiang, Upendra V. Chaudhari, et al. Short-time Gaussianization for robust speaker verifieation[C]. Proc. IEEE ICASSP, 2002, 1: 681-684.
    [74]刘波,戴礼荣,王仁华,杜俊,李锦宇.基于双高斯GMM的特征参数规整及其在语音识别中的应用.自动化学报, 2006, 4: 519-525.
    [75] S. Furui. Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech and Signal Processing, 1981, 29(2): 254-272.
    [76] S. Furui. Speaker-independent isolated word recognition using dynamic features of the speech spectrum. IEEE Transactions on Acoustics, Speech and Signal Processing, 1986, 34(1): 52-59.
    [77]韩纪庆,王承发.二阶CMS用于电话语音识别的通道补偿.哈尔滨工业大学学报, 1998, 30(6): l05-107.
    [78] Hema A. Murthy, Francoise Beaufays. Robust text-independent speaker identification over telephone channels. IEEE transactions on speeeh and audio processing, 1999, 7(5): 554-568.
    [79] I. Jolliffe. Principal component analysis, Springer Verlag, NewYork, 1986.
    [80] Jonathon Shlens. A Tutorial on Principal Component Analysis. University of California, 2005.
    [81] A. P. Dempster, N. M. Laird and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of Royal Statistical Society, 1977, 39(1): 1-38.
    [82] J. A. Bilmes. A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical ReportTR-97-021, 1998.
    [83] Giuseppe Patane and Marco Russo. The Enhanced LBG Algorithm.
    [84] B. H. Juang. Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. AT&T Tech. J. , 1985, 64(6): 1235-1250.
    [85] B. H. Juang. Maximum likelihood estimation for multivariate mixture observations of markov chains. IEEE Trans. Information Theory, 1986, 32(2): 307-309.
    [86] J. L. Gauvain and C. H. Lee. Maximum a posterior estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech and Audio Processing, 1994, 2(2): 291–298.
    [87] Alex Solomonoff, W. M. Campbell, Ian Boardman. Advances in channel compensation for svm speaker recognition. ICASSP, 2005, 1: 629-632.
    [88] W. M. Campbell, D. E. Sturim, D. A. Reynolds, A. Solomonoff. SVM based speaker verification using a Gmm supervector kernel and NAP variability compensation. ICASSP, 2006, 1: 97-100.
    [89] Jing Deng, Thomas Fang Zheng and Wenhu Wu. Session variability subspace projection based model compensation for speaker verification. Proc. ICASSP, 2007, 4: 47-50.
    [90]杨行峻,郑君里.人工神经网络与盲信号处理.清华大学出版社, 2002: 327-392.
    [91] Patrick Kenny, Gilles Boulianne, Pierre Dumouchel. Eigenvoice Modeling With Sparse Training Data. IEEE transactions on Speech and Audio Processing, 2005, 13(3): 345–359.
    [92] Patrick Kenny. Joint Factor Analysis of Speaker and Session variability: Theory and Algorithms, Tech. Report CRIM-06/08-13, 2005, Avail on: http://www.crim.ca/Publications/2007/documents
    [93] P. Kenny, P. Ouellet, N. Dehak, V. Gupta and P. Dumouchel. A study of inter-speaker variability in speaker verification. IEEE Trans. Audio, Speech and Language Processing, 2008, 16(5): 980–988.
    [94] R. Vogt, B. Baker and S. Sridharan. Modelling session variability in text-independent speaker verification. Interspeech, 2005: 3117-3120.
    [95] Patrick Kenny and Pierre Dumouchel. Experiments in Speaker Verification using Factor Analysis Likelihood Ratios. Proc. Odyssey, Toledo, Spain, 2004: 219-226.
    [96] O. Glembek, L. Burget, N. Brummer and P. Kenny. Comparison of Scoring Methods used in Speaker Recognition with Joint Factor Analysis. IEEE International Conference on Acoustics, Speech, and Signal Processing, Taipei, Taiwan, 2009: 4057-4060.
    [97] Patrick Kenny, Gilles Boulianne, Pierre Dumouchel. Joint Factor Analysis versus Eigenchannels in Speaker Recognition. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(4): 1435-1447.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700