基于瞬时频率估计的耳语音说话人识别研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
耳语音作为人类的一种特殊发音方式,在语音学和生理学上都有别于正常音。随着社会经济生活的发展,耳语音在很多场合下得到了应用,在金融通信、公安司法、身份安全认证等领域中发挥着越来越重要的作用。
     耳语音在说话人识别的实际应用中,可以作为正常音的一种补充,完善说话人识别系统的性能。耳语音自身的特点决定了其识别的难度大于正常音,且易遭受信道的干扰,传统的语音参数在耳语音应用中稳健性较差,因此研究一种有效的耳语音参数用于说话人识别系统是一个亟待解决的问题。另外,考虑到当一个正常音训练的说话人系统用耳语音识别时,系统的性能表现会急速下降。那么在无法获得充分耳语音训练数据的前提下,如何提高耳语音说话人识别的准确率也值得探讨。针对以上问题,本文做了以下几个方面的工作。
     一、针对语音产生中的非线性现象,根据语音产生的共振峰调制理论,介绍了语音产生的调幅-调频模型(AM-FM Model),详细讨论了基于此模型的Teager能量算子和能量分离算法(DESA)在语音中的应用,并和其他具有类似功能的算法做了比较。
     二、根据多成分AM-FM信号侦测的多带解调分析(MDA)理论和能量分离算法,获得语音信号的瞬时幅度和频率。通过两者的加权估计得到了一种语音特征参数—瞬时频率估计(IFE),该参数可以描绘语音的精细频率结构。将该特征用于耳语话者识别并和传统的Mel倒谱系数(MFCC)进行了比较。实验结果表明,随着测试人数的增加和信道变化,新特征参数具有更好的识别率和稳健性。
     三、为了改善正常音训练的说话人系统中,用耳语音测试造成的系统性能急速下降的情况。本文将耳语音和正常音假设成两种不同的信道,在通用背景模型的基础上,对语音参数做特征映射后再进行训练和识别,以减少信道的影响。实验结果表明,加入特征映射后系统的识别率得到提高,并且和传统的MFCC参数相比,IFE参数的识别率和稳健性都有提高。
Whispered speech, a special phonation mode different from normal speech in phonetics and physiology, has existed in human daily life for long time. With the ever-increased economic and technology progress in the society, whispered speech has been become a more important rule and applied widely in many circumstance such as finance service, public security and identity identification.
     Under the practical use in speaker identification, whispered speech could be considered as a supplement to the normal speech to improve the performance of speaker identification system. Because whispered speech is vulnerable to the interference from communication channel and low recognition accuracy due to itself character, traditional speech parameter has worse robust performance in whispered speech application. It is necessary to study and develop an effective character representation of whispered speech in speaker identification. In an addition to this problem, as a speaker system trained mainly by normal speech, the performance of system declines sharply as tested with whispered speech. Therefore, how to improve speaker identification accuracy under the condition of sparse whispered speech data is a valuable problem. The contribution of this paper to whispered speech speaker identification are as follow.
     1. Based on the non-linear phenomenon in speech and formant demodulation theory of speech production, this paper introduce AM-FM model of speech production particularly. A energy operator called Teager energy operator and discrete energy separation algorithm (DESA) are introduced in speech application. Meanwhile, a comparison between the energy separation algorithm and other algorithm which has similar function is presented.
     2. According to multiband demodulation analysis (MDA) in mixed components signal detection, the instantaneous amplitude and frequency of speech signal are extracted by DESA. A kind of speech parameter called instantaneous frequency estimation (IFE) are extracted by the weighted estimation both on amplitude and frequency to represent the accurate frequency structure of speech. The proposed speech parameters have been applied to whispered speaker identification and compared with conventional MFCC. The experiment results show that, as the test objectives increase, the IFE parameters perform as well as MFCC, even a little better. When the test channels are changed, comparing with MFCC, IFE effectively improves the robust performance of system.
     3. The performance of speaker identification system, trained mainly with neutral voices, declines sharply when tested with whispered speech. In order to change this phenomenon, on the condition that whispered speech and normal speech come from different channels, feature mapping is used to reduce the effects of channels before training and testing speaker system based on the universal background model (UBM). The experiment results show that, feature mapping improves the accuracy of system, and compared with MFCC, IFE provides better robustness and accuracy results than MFCC.
引文
1 R.L特拉斯克.《语音学和音系学字典》(A dictionary of Phonetics and Phonology).《语音学和音系学字典》编译组译,语文出版社,2000:26.
    2 Morris R.W. Enhancement and recognition of whispered speech. [Ph.D]. Georgia Institute of Technology, USA, 2002.
    3 Morris R.W., Clements M.A.Reconstruction of speech from whispers.Medical Engineering & Physics, 2004; 24(8):515-520.
    4 Slobodan T.Jovicic, Zoran Saric. Acoustic analysis of consonants in whispered speech.Journal of Voice, 2008; 22(3):263-274.
    5 http://www.ed.ac.uk
    6 http://www.britac.ac.uk/funding/index.html
    7栗学丽,丁慧,徐柏龄.基于熵函数的耳语音声韵分割法.声学学报, 2005; 30(1):69-75.
    8陈雪勤,赵鹤鸣.基于听觉模型的汉语耳语音声调检测.电子学报, 2009;37(4):864-867.
    9潘欣裕,赵鹤鸣,陈雪勤.基于EMD拟合特征的耳语音端点检测.电子与信息学报, 2008; 20(2):362-366
    10陶智,赵鹤鸣.基于修正Mel域掩蔽模型和无语音概率的耳语音增强.声学学报, 2009; 34(4):370-377.
    11 Gong Chenghui, Zhao Heming, et al. A preliminary study on emotions of Chinese whispered speech. IEEE IFCSTA 2009:429-433.
    12 Taisuke Ito, Kazuya Takeda, Fumitada Itakura. Analysis and recognition of whispered speech. Speech Communication, 2005; 45(2):139-152.
    13 Q.Jin, S.S.Jou, T.Schultz. Wisphering speaker identification. IEEE ICME, 2007:1021-1024.
    14林玮,杨莉莉,徐柏龄.基于修正MFCC参数汉语耳语音的话者识别.南京大学学报(自然科学), 2006; 42(1):54-62.
    15 Chi Zhang, John H.L.Hansen. Analysis and classification of speech mode: whispered through shouted. INTERSPEECH 2007:2289-2298.
    16 Xing Fan, John H.L.Hansen. Speaker identification for whispered speech based on frequency warping and score competition. INTERSPEECH 2008:1313-1316.
    17 Xin Fan, John H.L.Hansen. Speaker indentification with whispered speech based on modified LFCC parameters and feature mapping. IEEE ICASSP 2009:4553-4556.
    18 H.M.Teager, S.M.Teager. Some observation on oral airflow during phonation. IEEE Trans on Acoustic, Speech and Signal Processing, 1980; 28(5):599-601.
    19 H.M.Teager, S.M.Teager. Evidence for nonlinear production mechanisms in vocal tract. Speech production and speech modeling, Vol.55. Boston: Kluwer Academic publisher, 1990:241-261.
    20 P.Maragos, J.F.Kaiser, T.F.Quatieri. Energy separation in signal modulations with application to speech analysis. IEEE Trans on Signal Process, 1993; 40(10):3024-3051.
    21 J.F.Kaiser. On a simple algorithm to calculate the‘energy’of a signal. IEEE ICASSP, 1990:381-384.
    22 J.F.Kaiser. Some useful properties of Teager's energy operators. IEEE ICASSP 1993; III:149-152.
    23 P.Maragos, T.F.Quatieri, J.F.Kaiser. Speech nonlinearities,modulations and energy operators. IEEE ICASSP 1991:421-424.
    24 P.Maragos, J.F.Kaiser, T.F.Quatieri. On separating amplitude from frequency modulations using energy operators. IEEE ICASSP 1992;II:1-4.
    25 A.Potamianos, P.Maragos. A comparison of the energy operator and the Hilbert tranform approach to signal and speech demodulation. Signal processing, 1994; 37(1):95-120.
    26 A.C.Bovik, P.Maragos, T.F.Quatieri. AM-FM energy detection and separation in noise using multiband energy operators. IEEE Trans on Signal Processing, 1993; 41(12):3245-3265.
    27 D.A.Reynolds, T.F.Quatieri, Robert B.Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 2000; 10(1):19-41.
    28 D.A.Reynolds. Channel robust speaker verification via feature mapping. IEEE ICASSP 2003; (II):53-56.
    29 D.A.Reynolds, Richard C.Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing,1995,3(1):72-83.
    30 D.A.Reynolds. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 1995; 17(2):91-108.
    31 T. F. Quatieri.《离散时间语音信号处理—原理与应用》.电子工业出版社,北京, 2004.
    32陈雪勤.汉语耳语音—正常音转换的机理研究. [博士论文].苏州大学,苏州,江苏, 2009.
    33赵艳,赵力等.耳语音的语音处理研究综述.声学技术, 2008; 27(4): 562-569.
    34 L. R. Rabiner, R.W.Schafer. Digital Processing of Speech Signal, Prentice Hall, Englewood Cliffs, New Jearsy, 1978.
    35 B.S.Atal, Suzanne L.Hanauer. Speech analysis and synthesis by Linear Prediction of the Speech Wave. J. Acoust. Soc. Am, 1971; 50(2B):637-655.
    36 S.B.Davis, Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans on Acoustics, Speech and Signal Processing, 1980; 28(4):357-366.
    37 Hynek Hermansky. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am, 1990; 87(4):1738-1752.
    38 H.M.Teager, S.M.Teager. A phenomenogical model for vowel production in the vocaltract,chapter in Speech Science:Recent Advances, R.G.Daniloff, ed. College-Hill Press, San Diego, California, 1985.
    39 T.J.Thomas. A finite element model of fluid flow in the vocal tract. Computer Speech and Language, 1986; 1(2):131-151.
    40 R.S.McGowan. An aero acoustic approach to phonation. J. Acoust. Soc. Am, 1988; 83(2):696-704.
    41张磊,韩纪庆,王承发.声道的调频-调幅模型及其在语音分析中的应用.计算机研究与发展, 2002; 39(6):689-695.
    42 A. Potamianos, P. Maragos. Speech formant frequency and bandwidth tracking using multiband energy demodulation. J. Acoust. Soc. Ame, 1996; 99(6):3795-3806.
    43 Helen M.Hanson, P. Maragos, A. Potamianos. A system for finding speech formants and modulation via energy separation. IEEE Trans on Speech, Audio Processing, 1994; 2(3):436-443.
    44 Alexandros Potamianos, P. Maragos. Speech analysis and synthesis using an AM-FM modulation model. Speech Communication, 1999; 28(3):195-209.
    45 Shan Lu, Peter C.Dorschuk. Nonlinear modeling adn processing of speech based on sums of AM-FM formant models. IEEE Trans on Signal Processing, 1996; 44(4):773-782.
    46 C.R.Jankowski, Jr,T.F.Quatieri, D.A.Reynolds. Measuring fine structure in speech:Application to speaker identification. IEEE ICASSP, 1995:325-328.
    47 Alexandros Potamianos, P.Maragos. Time-frequency distributions for automatic speech recognition. IEEE Trans on Speech and Audio Processing, 2001; 9(3):196-200.
    48 Dimitrios Dimitriadis, P. Maragos, Alexandros Potamianos. Robust AM-FM festures for speech recognition. IEEE Signal Process. Lett, 2005; 12(9):621-624.
    49 Marco Grimaldi, Fred Cummins. Speaker identification using instantaneous frequencies. IEEE Trans on Audio, Speech, Language Processing, 2008; 16(6): 1097-1111.
    50 B.Santhanam, P.Maragos. Energy demodulation of two-component AM-FM signal mixtures. IEEE Signal Processing Letters, 1996; 3(11):294-298.
    51 Mohammed Bahoura, Jean Rouat. Wavelet speech enhancement based on the Teager energy operator. IEEE Signal Processing Letters, 2001; 8(1):10-12.
    52 Ming Liang, I.S.Bozchalooi. An energy operator approach to joint application of amplitude and frequency-demodulations for bearing fault detection. Mechanical Systems and Signal Processing, 2010; (In Press).
    53 V.Kandia, Y.Stylianou. Detection of sperm whale clicks based on the Teager-Kaiser energy operator. Applied Acpustics, 2006; 67(11):1144-1163.
    54 P. Maragos, J. F. Kaiser, T.F.Quatieri. On amplitude and frequency demodulation using energy operator. IEEE Trans on Signal Processing, 1993; 41(4):1532-1550.
    55 L.R.Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989;77(2):257-286.
    56 Dempster. A, Laird. N. , Rubin.D. Maximum likelihood from incomplete data via EM algorithm. J. Roy. Stat. Soc, 1977; (39):1-38.
    57边肇祺,张学工.《模式识别》,清华大学出版社,北京, 2000年.
    58 Gauvain.J.L, Chin-Hui Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Transactions on Speech and Audio Processing, 1994; 2(2):291-298.
    59郭武.复杂信道下的说话人识别. [博士论文].中国科学技术大学,合肥,安徽, 2008年。