基于HHT的语音情感识别研究

英文题名：Speech Emotion Recognition Research Based on Hilbert-Huang Transformation
作者：谢珊
论文级别：硕士
学科专业名称：物理电子学
中文关键词：边际谱 ; 边际能量 ; 语音情感识别 ; HHT
英文关键词：marginal spectrum ; marginal energy ; emotion recognition ; HHT
学位年度：2008
导师：曾以成
学科代码：080901
学位授予单位：湘潭大学
论文提交日期：2008-05-24

摘要

语音情感识别要求从语音样本中提取情感特征参数,并采用一定的模式识别方法,识别语音中包含的情感类型。这是语音信号处理一个新兴的研究方向,具有广阔的应用前景。语音情感识别中,如何提取能有效反映情感信息的特征是最关键的问题,它直接决定识别的结果。
     本文用希尔伯特-黄变换(HHT)对情感语音进行处理,从整体上分析其特征,并在此基础上提取特征参数,进行文本无关和说话人无关的语音情感识别,取得满意的效果。具体内容如下:
     详细论述HHT的原理,揭示其本质特征和用于信号处理的优点。在此基础上提出边际能量的概念,并将其与边际谱一起用于分析情感语音。对高兴、生气、厌烦和平静四种情感语音进行统计分析,发现边际能量和边际谱分别反映情感语音在时域和频域的能量分布特征,能体现不同情感的内在规律。因此,将其作为情感识别的依据,在边际能量的基础上提取时域特征希尔伯特能量统计值(EHHT),在边际谱的基础上提取频域特征:子带能量(SE)、子带能量一阶差分(DSE)、子带能量倒谱系数(SECC)和子带能量倒谱系数的一阶差分(DSECC)。最后采用矢量量化(VQ)的方法,分别用上述特征做说话人无关、文本无关的语音情感识别。结果表明,单独使用时域特征或频域特征不能有效识别语音情感,而将此两种特征结合用于识别,能使识别率最高达到98.53%,且随码本尺寸的变化波动很小,效果相对稳定。
     本文将HHT用于情感语音处理,将时频特征结合用于语音情感识别,不仅提高了识别率,而且大大缩小了码本尺寸,具有一定的实际意义。
Speech emotion recognition demands distilling emotional features from speech signals and adopting certain pattern recognition method to determine which emotion the speech contains. It is a new area in speech processing and has wide applications. Feature extraction, which reflects the results directly , is the most important factor.
     In this thesis, Hilbert-Huang Transformation is applied to emotional speech processing and analyzing. HHT features are distilled and text-independent, speaker-independent emotion recognition is simulated. The details are as follows:
     Firstly, theory of HHT is discussed, its essence and merit in signal processing is shown, based on which, marginal energy is proposed, and used in emotional speech analysis together with marginal spectrum. Statistical analysis of four emotions: happy, angry, boring and netrual demonstrates that marginal energy and marginal spectrum well reflect the energy distribution characteristics in time and frequency domain respectively. Thus, they can be a basis of emotion recognition. Then statistical Hilbert energy (EHHT) is distilled from marginal energy, sub-band energy(SE) and its derivation(DSE), sub-band energy cepstrum coefficients(SECC) and its derivation (DSECC) are distilled from marginal spectrum. At last, with pattern recognition theory Vector Quantization(VQ), speaker-independent and text-independent emotion recognition is simulated using the above features respectively. Results demonstrate that, time-domain feature or frequency-domain feature respectively can not recognize speech emotion effectively, but combination of these two features make a good recognition rate of 98.53%.
     In this thesis, HHT is applied to speech processing and emotion recognition, the use of HHT time-frequency features not only enhance the recognition rate, but also reduce the code size, thus , the research of this thesis is both meaningful and feasible .

引文

[1]王青.基于神经网络的汉语语音情感识别的研究[D].杭州:浙江大学,2004.
    [2] Picard R. Affective Computing [M].The MIT Press, Cambridge, MA, 1997.
    [3]苏庄銮.情感语音合成[D].合肥:中国科学技术大学,2006.
    [4] Ortony A, Clore G L, Collins A. The cognitive structure of emotions[M] .Cambridge University Press,1988.
    [5] Vladimir H, Zdravko K. Context-Independent Multilingual Emotion Recognition from Speech Signals[J]. International Journal of Speech Technology,2003,6:311-320.
    [6]中国社会科学院.现代汉语词典[M].上海:商务印书馆,2005.
    [7]姜晓庆.人机通信中的情感语音处理[D].济南:山东大学,2006.
    [8] Kyung H H, Eun H K, Yoon K K. Improvement of Emotion Recognition by Bayesian Classifier Using Non-zero-pitch Concept [C].IEEE International Workshop on Robots and Human Interactive Communication,2005.
    [9] Cowie R,Douglas-Cowie E,Tsapatsoulis N. Emotion recognition in human-computer interaction [J]. IEEE Signal Proc.Mag, 2001,18:32-80.
    [10]谢波.普通话语音情感识别关键技术研究[D].杭州:浙江大学,2006.
    [11] Scherer K R , Banziger T. Emotional expression in prosody: a review and an agenda for future research. [J] .Speech Prosody, Nara, Japan, 2004:359-366 .
    [12] Oudeyer P. Novel useful features and algorithms for the recognition of emotions in human speech[J]. Proceedings of Speech Prosody. Axie-en-Provence, France,2002:547-550.
    [13] Lee C, Narayanan S, Pieraccini R. Recognition of Negative Emotions from the Speech Signal [J]. Proc.Automatic Speech Recognition Understanding, 2001,12.
    [14] Petrushin V. Emotion in Speech: Recognition and Application to Call centers [J]. Artif. Neu.Net.Engr.1999.
    [15] Cowie R, Douglas-Cowie E, Tspatoulis N. Emotion Recognition in Human-Computer Interaction [J]. Signal Processing. New York: Magazine.IEEE Publisher, 2001,18(1):32-80.
    [16] Paeschke A, Sendlmeier W F.Prosodic Characteristics of Emotional Speech: Measurements [J]. Northern Ireland: Proceedings of the ISCA Workshop on Speech and Emotion, 2000:75-80.
    [17] Schuller B, Gerhard R, Manfred L. Hidden Markov Model-Based Speech Emotion Recognition [J]. Proceedings of International Conference on Acoustic Speech and Signal Processing. IEEE publisher 2003,II:1-4.
    [18] Schuller B, Rigoll G, Lang M. Hidden Markov Model-Based Speech Emotion Recognition[J]. HongKong. Proceedings of the ICASSP 2003, IEEE,Vol.II:1-4.
    [19] Kwon O W, Chan K, Hao J. Emotion Recognition by Speech Signals [J].Eurospeech, 2003:125-128.
    [20] Park C H, Hco K S, Lee D W. Emotion Recognition based on frequency analisis of speech signal [J]. Internatinal journal of the Acoustic Society of America, 1993:1097-1108.
    [21] Schuller B, Rigoll G, Lang M. Hidden Markov Model-based Speech Emotion Recognition [J]. Proceedings of IEEE-ICASSP, 2003:401-405.
    [22] Tato R S, Kompe R, Pardo J M. Emotional Space Improves Emotion Recognition [J]. ICSLP, 2002:2029-2032.
    [23] Wang Y, Guan L. Recognizing human emotion from audiovisual information [J]. Proceedings ICASP 2005,2005: 1125-1128.
    [24] Tin L N, Foo S W, Liyanage C D S. Speech Based Emotion Classification [C]. Electrical and Electronic Technology, Proceedings of IEEE Region 10 International Conference, 2001, Vol1:297-301.
    [25]杨莹春.说话人特征及模型研究[D].杭州:浙江大学,2003.
    [26] Valery A P. Emotion in Speech: Recognition and Application to call centers [C] .Proceedings of the 1999 Conference on Artificial Neural Networks in Engineering.
    [27]芦涛.基于SVM的汉语语音情感识别的研究[D].燕山大学,2007.
    [28] Yu, Chang F. Emotion Detection From Speech to Enrich Multimedia Content [C]. BeiJing: the Second IEEE Pacific-Rim Conference on Multimedia2001:24-26.
    [29]赵力,将春辉,邹采荣.语音信号中的情感特征分析和识别的研究[J] .电子学报,2004,4:606-609.
    [30] Dan-Ning Jiang,Lian-Hong Cai.Classifying Emotion in Chinese Speech by Decomposing Prosodic Features[J]. IEEE Signal Processing Magazine.2001, 15:32-36 .
    [31] Jianhua Tao,Yongguo Kang.Feature Importance Analysis for Emotional Speech Classification [J].ACII 2005,LNCS 3784,2005:449-457.
    [32] Tsang-Long Pao,Yu-Te Chen,Jun-Heng Yeh. Emotion Recognition From Madarin Speech Signals[J]. ISCSLP,2004:301-304.
    [33] Yi-Linlin, Gang Wei. Speech Emotion Recognition Based on HMM and SVM [C]. GuangZhou: Proceedings of the Fourth International Conference on Machine Learning and Cybernetics 2005:18-21.
    [34] Kitahara Y, Tohkura Y. Prosodic Control to Express Emotion for Man-Mechine Speech Interaction [J]. IEICE Trans. 1994, 75:155-163.
    [35] Moriyama Y, Saito H, Ozawa S.Evaluaion of the Relationship between Emotional Concepts and Emotional Parameters on Speech [J], IEICE Trans ,1999:703-711.
    [36] Nicholson J, Takahashi K, Nakatsu R. Emotion Recognition in Speech Using Neural Netwoks [J]. Neural Comput &Applic ,2000, 9:290:296.
    [37] Cowie R. Emotion Recognition in Human-Computer interaction [J]. IEEE Signal Processing Magazine.2001,18:32-80.
    [38] Bjorn S, Stephan R.Evolutionary Feature Generation in Speech Emotion Recognition [J], ICME,2006:5-8.
    [39] Norden E H, Zheng S, Steven R L. The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-stationary Time Series Analysis [J]. Proc. R. soc. Lond.A, 1998:903-905.
    [40] Norden E H, Man-Li C W, Steven R L. A Confidence Limit for the Empirical ModeDecomposition and the Hilbert Spectrum Analysis[J].The Royal Society, Proc. R. Soc. Lond A, 2003,459:2317-2345.
    [41] Huzlou L, Steven L B, Robert D. Empirical Mode Decomposition: a Method for Analyzing Neural Data [J]. Neurocomputing, 2005,801-807.
    [42] Liang H, Lin Z, McCallum R W. Artifact Reduction in Electrogastrogram Based on the Empirical Mode Decomposition Method [J], Med. Biol. Eng. Compute, 2000,38:35-41.
    [43] Arturas J, Vaidotas M, Arunas L. The Hilbert-Huang Transform for Detection of Otoacoustic Emissions and Time-Frequency Mapping [J]. INFORMATICA, 2006,17(1):25-38.
    [44]胡广书.数字信号处理:理论、算法与实现[M].北京:清华大学出版社,2003.
    [45]于德介,陈淼峰,程军圣.基于EMD的奇异值熵在转子系统故障诊断中的应用[J].振动与冲击,2006,2,(25):24-27.
    [46] Robert W R. The FFT Fundamentals and Concepts[M]. Prentice-Hall, 1985.
    [47]申丽然.Hilbert-Huang变换及其在含噪语音信号处理中的应用研究[D].哈尔滨工程大学,2006:30.
    [48]张郁山,梁建文等.应用自回归模型处理EMD方法中的边界问题[J].自然科学进展,2003,13(10):1054-1059.
    [49]盖强,马孝江,张海勇.一种消除局域波法中边界效应的新方法[J].大连理工大学学报,2002,42(1):115-117.
    [50]李凌.基于HHT的说话人识别研究[D].湘潭大学,2006.
    [51]于德介,程军圣,扬宇.基于EMD和AR模型的滚动轴承故障诊断方法[J].振动工程学报,2004,3(17):332-335.
    [52] Rar R Z, Lance V D, Jianwen L. On Estimating Site Damping with Soil Non-linearity from Earthquake Recordings[J].International Journal of Non-Linear Mechanics , 2004,39:1501-1517.
    [53]樊长博,张来斌,王朝晖等.基于EMD与功率谱分析的滚动轴承故障诊断方法研究[J].机械强度,2006,28(4):628-631.
    [54]申丽然.Hilbert-Huang变换及其在含噪语音信号处理中的应用研究[D].哈尔滨工程大学,2006:16.
    [55]赵力.语音信号处理[M].北京:机械工业出版社,2003.
    [56]胡航.语音信号处理[M].哈尔滨:哈尔滨工业大学出版社,2000.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700