不匹配信道下耳语音说话人识别研究

英文题名：Research on Whispered Speaker Identification in Channel Mismatch Conditions
作者：顾晓江
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：耳语音 ; 说话认识别 ; 联合因子分析 ; 混合补偿 ; 支持向量机
英文关键词：whispered speech ; speaker identification ; joint factor analysis ; hybrid compensation ; support vector machine
学位年度：2011
导师：赵鹤鸣
学科代码：081002
学位授予单位：苏州大学
论文提交日期：2011-05-01

摘要

耳语音作为人类的一种辅助发音方式,在日常生活中起着较为广泛的作用,尤其是在金融领域,公安司法领域中各种身份的确认。说话者为了保证信息的私密性,常常会用到耳语音。
     正因如此,耳语音说话人识别也作为一个新的课题被提出来。耳语音主要是用在手机通话中,语音必然会受到信道畸变的影响。传统的识别模型遇到训练和测试的信道环境差异变大时,识别率就会大大受到影响。因此,必然需要一种稳健的信道补偿算法来增强这个说话人识别系统。为了解决这个问题,本文做了以下几个方面的工作:
     一、将各种信道的耳语音数据混合在一起训练通用背景模型(UBM),然后在此基础上进行最大后验概率(MAP)自适应获得说话人模型,将此模型和常规的GMM模型进行识别率的比较。实验证明,UBM模型优于普通的GMM。
     二、将联合因子分析(JFA)应用到耳语识别中,根据耳语数据库的特性,采取分开估计和省略残差空间的方法。具体在识别过程中,通过将训练所得的说话人因子和测试所得的信道因子相结合的方式,达到说话人不断适应测试信道环境的目的。实验结果显示修改后JFA的识别效果大大提升。另外,根据JFA在短时识别方面效果不理想,提出了一种在模型上保持说话人因子不变,而将信道因子用到特征方面,对每一帧特征矢量进行补偿的混合补偿法,该方法相对于JFA来说补偿的更为细致,实验显示HH信道训练时1s和2s平均识别率分别提高4.36%和3.89%,EP信道训练时1s和2s平均识别率分别提高4.14%和2.64%。
     三、根据支持向量机(SVM)的区分性,将说话人超向量输入到SVM中,结果系统性能不如UBM-MAP系统。这时将说话人因子矢量输入到SVM中,由于说话人因子在辨认系统中特征维数低,易线性可分,获得了良好的识别效果。然后经过三种信道补偿方法进一步去冗余,取得了和JFA相当的识别结果。
The whispered speech is acted as an auxiliary way of communication and it is widely used in human life at the same time, especially in the all kinds of identity recognition of finance area and justice area. Speaker usually can use whispered speech in order to keep information secret.
     So, the whispered speaker identification is also noticed as a new project. The whispered speech is often used in mobile phone environment, which is affected by channel distortion. The traditional model gets low recognition accuracy when the channel environment difference between training and testing is obvious. Therefore, a robust channel compensation algorithm must enhance the speaker recognition system. In order to solve this problem, the article’s work is as follows:
     1. Mix all the kinds of channel whispered speech to train a universal background model (UBM), then on this base, maximum a posteriori adaptation is adopted to train the speaker model. Compare this model with GMM, the experiment result proves that the UBM performs better than normal GMM.
     2. Joint factor analysis (JFA) is introduced in whispered speaker identification. According the speech database’s characteristic, decoupled estimation and omitting residual subspace are applied. In the specific identification process, the speaker factor from training utterance and channel factor from testing utterance are combined to fit the test channel dynamically. The experiment shows that improvement JFA achieves high recognition result. In addition, JFA is not ideal in the short-time identification. A new hybrid compensation method which keeps speaker factor in model domain and applies channel factor in feature domain is proposed. This method is to compensate each frame feature vector and more meticulous than JFA. The experiment shows 1s and 2s average identification rate separately improve 4.36% and 3.89% when HH channel is trained. In addition, EP channel separately improve 4.14% and 2.64%.
     3. According to support vector machine (SVM)’s discriminability, the speaker supervector is input into the SVM. But the system performance is not as good as UBM-MAP. Then the speaker factor vector is input into the SVM. Because the speaker factor has the property of low dimension and linear discriminant availability, it achieves excellent accuracy result. After that, three kinds of channel compensation technique are used to improve the system’s robustness further and obtain quite identification result compared to JFA.

引文

1赵力.《语音信号处理》.机械工业出版社,北京,2003.
    2 T. F. Quatieri.《离散时间语音信号处理—原理与应用》.电子工业出版社,北京, 2004.
    3 R.L特拉斯克.《语音学和音系学字典》(A dictionary of Phonetics and Phonology).《语音学和音系学字典》编译组译,语文出版社,北京,2000.
    4栗学丽,丁慧,徐柏龄.基于熵函数的耳语音声韵分割法.声学学报, 2005; 30(1):69-75.
    5 X.Q.Chen, H.M.Zhao, X.Y.Pan. Fractal analysis of Chinese whispered speech and its application on endpoint detection. WSEAS Trans. on Signal Processing.2006.2(9):1161-1166.
    6潘欣裕,赵鹤鸣,陈雪勤.基于EMD拟合特征的耳语音端点检测.电子与信息学报, 2008; 20(2):362-366.
    7 Slobodan T.Jovicic, Zoran Saric. Acoustic analysis of consonants in whispered speech. Journal of Voice, 2008; 22(3):263-274.
    8 Taisuke Ito, Kazuya Takeda, Fumitada Itakura. Analysis and recognition of whispered speech. Speech Communication, 2005; 45(2):139-152.
    9刘建新.汉语耳语音转换为正常语音的共振峰结构研究.[硕士论文].苏州大学,苏州,江苏, 2007.
    10 M. Gao. Tones in Whispered Chinese: Articulatory and Perceptual Cues. [Master], University of Victoria, Canada, 2002.
    11 Yun Jin, Yan Zhao, Chengwei Huang, Li Zhao. Study on the Emotion Recognition of Whispered Speech. IEEE GCIS,2009:242-246.
    12陈雪勤,赵鹤鸣.基于听觉模型的汉语耳语音声调检测.电子学报, 2009;37(4):864-867.
    13 Q.Jin, S.S.Jou, T.Schultz. Wisphering speaker identification. IEEE ICME,2007:1021-1024.
    14 R. W. Morris. Enhancement and recognition of whispered speech. [PhD Thesis], Georgia Institute of Technology, USA, 2002.
    15林玮,杨莉莉,徐柏龄.基于修正MFCC参数汉语耳语音的话者识别.南京大学学报(自然科学), 2006; 42(1):54-62.
    16王敏,赵鹤鸣.基于多带解调分析和瞬时频率估计的耳语音话者识别.声学学报, 2010; 35(4):471-476.
    17 Xin Fan, John H.L.Hansen. Speaker indentification with whispered speech based on modified LFCC parameters and feature mapping. IEEE ICASSP 2009:4553-4556.
    18 A.Acero, Acoustical and Environment Robustness in Automatic Speech Recognition. [PhD Thesis], Carnegie Mellon University,1990.
    19 A.Acero, L.Deng, T.T.Kristjansson, J.Zhang. HMM adaptation using vector Taylor series for noisy speech recognition. ICSLP 2000:869-872.
    20 P. Kenny. Joint factor analysis of speaker and session variability: Theory and algorithms,Tech.ReportCRIM-06/08-13,2005.[Online]Available:http://www.crim.ca/perso/patrick.kenny.
    21 D.A.Reynolds, T.F.Quatieri, Robert B.Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 2000; 10(1):19-41.
    22 D.A.Reynolds, Richard C.Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing,1995,3(1):72-83.
    23 Nello Cristianini, John Shawe-Taylor.《支持向量机导论》.电子工业出版社,北京, 2004.
    24 A. Hatch, S. Kajarekar, and A. Stolcke. Within-Class Covariance Normalization for SVM-Based Speaker Recognition, ICSLP, 2006:1471-1474.
    25 K. Fukunaga. Introduction to Statiscal Pattern Recognition, Academic Press, San Diego, California, 1990.
    26 W.M. Campbell, D.E. Sturim, D.A. Reynolds, and A. Solomonoff. SVM Based SpeakerVerification using a GMM Supervector Kernel and NAP Variability Compensation. IEEE ICASSP 2006:97-100.
    27潘欣裕.汉语耳语音特征分析与应用研究.[硕士论文].苏州大学,苏州,江苏, 2007.
    28王敏.基于瞬时频率估计的耳语音说话人识别研究.[硕士论文].苏州大学,苏州,江苏, 2010.
    29王炳锡,屈丹,彭煊.《实用语音识别基础》.国防工业出版社,北京,2005.
    30李香萍.基于神经网络的说话人识别算法的研究与实验.电子测量与技术, 2007; 30(11):170-172.
    31候风雷,王炳锡.基于支持向量机的说话人辨认研究.通信学报, 2002; 23(6):61-67.
    32何强,何英. MATLAB扩展编程.清华大学出版社,北京,2002.
    33 B. H. Juang, L. R. Rabiner, J. G. Wilpon. On the use of bandpass filtering in speech recognition. IEEE Transactions on Acoustic, Speech and Signal Processing, 1987, 35: 871-879.
    34王琰蕾.基于JFA的汉语耳语音说话人识别.[硕士论文].苏州大学,苏州,江苏, 2010.
    35 A.P.Dempster, N.M.Laird and D.B.Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,1997,39(1):1-38.
    36 Gauvain,J.-L., Chin-Hui Lee. Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observation of Markov Chains. IEEE Transactions on Acoustic, Speech and Signal Processing, 1994, 2(2): 871-879.
    37 P.C.Woodland, D.Pye, M.J.F Gales. Iterative Unsupervised adaptation using maximum likelihood linear regression. ICSLP, 2002:1133-1136.
    38 S.Furui. Cepstral analysis technique for automatic speaker verification using dynamic features of the speech spectrum. IEEE Transactions on Audio, and Signal Processing, 1981;29(2):254-272.
    39 Jason Pelecanos, Sridha Sridharan. Feature Warping for Robust Speaker Verification.ISCA,2001:18-22.
    40 Reynolds D A. Channel robust speaker verification via feature mapping. ICASSP,2003:53-56.
    41 Wei Wu, Thomas Fang Zheng, Ming-Xing Xu and Frank Soong. A Cohort-based Speaker Synthesis for Mismatched Channels in Speaker Verification. IEEE Trans. on Audio, Speech and Language Processing, 2007;15(6):1893-1903.
    42郝大海.《社会调查研究方法》.中国人民大学出版社,北京,2005.
    43 Roland Auckenthaler, Michael Carey, and Harvey Lloyd-Thomas. Score normalization for text-independent speaker verification systems. Digital Signal Processing,2000;10:42-54.
    44 P.Kenny, P.Queller, N.Dehak, V.Gupta, and P.Dunouchel. A study of inter-speaker variability in speaker verification. IEEE Trans on Audio, Speech and Language Processing,2006;16(5):980-988.
    45 P.Kenny. Eigenvoice Modeling With Sparse Training Data. IEEE Trans. Audio, Speech and Language Processing,2005;13(3):345-354.
    46 P.Kenny, M.Mihoubi, P.Dumouchel. New MAP estimators for speaker recognition. Eurospeech,2003:2964-2967.
    47 P.Kenny, .Boulianne, P.Ouellet and P.Dumouchel. Joint Factor Analysis Versus Eigenchannels in Speaker Recognition. IEEE Trans on Audio, Speech and Language Processing,2007;15(4):1435-1447.
    48 C.Vaif, D.Colibro, and P.Laface. Channel factors compensation in model and feature domain for speaker recognition. In Odyssey’06, the Speaker Recognition Workshop, San Juan, Puetro Rico,2006.
    49李轶杰,郭武,戴礼荣.话者识别的信道补偿.小型微型计算机系统, 2008; 29(12): 2344-2347.
    50 N.Scheffer, R.Vogt, S.Kajarekar, and J.Pelecanos. Combination Strategies for a Factor Analysis Phone-Conditioned Speaker Verification System. ICASSP,2009:4053-4056.
    51 R.Vogt, B.Baker, S.Sridharan. Factor Analysis Subspace Estimation for SpeakerVerification with Short Utterances. Interspeech 2008:853-856.
    52 R.Vogt, J.Pelecanos, N.Scheffer, S.Kajarekar and S.Sridharan. Within-Session Variability Modeling for Factor Analysis Speaker Verification. Interspeech 2009:1563-1566.
    53郭武.复杂信道下的说话人识别. [博士论文].中国科学技术大学,合肥,安徽, 2008年.
    54 R.Courant, D.Hilbert. Methods of Mathematical Physics, Interscience,1953.
    55 W.M.Campbell, E.E.Sturim, D.A.Reynolds. Support Vector Machine Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters,2006;13(5):308-311.
    56 C. Hsu, C. Chang, and C. Lin. A Practical Guide to Support Vector Classification. Paper available at http://www.csie.ntu.edu.tw/.cjlin/papers/guide/guide.pdf.
    57 N. Dehak, P. Kenny, R. Dehak, O. Glembek, P. Dumouchel, L. Burget, and V. Hubeika. Support Vector Machines and Joint Factor Analysis for Speaker Verification. ICASSP,2009:4237-4240.
    58 Andrew O.Hatch, Andreas Stolcke. Generalized Linear Kernels for one-versus-all Classification:Application to Speaker recognition. ICASSP,2006:585-588.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700