基于语音信号的情感识别研究

英文题名：A Study on Recognition of Emotions in Speech
作者：金学成
论文级别：博士
学科专业名称：模式识别与智能系统
中文关键词：情感计算 ; 语音情感识别 ; 基频估计 ; 情感维度 ; 情感空间 ; 情感场 ; 情感建模
英文关键词：Affective computing ; emotion recognition ; fundamental frequency estimation ; emotion dimension ; emotion space ; emotion field ; emotion modeling
学位年度：2007
导师：汪增福
学科代码：081104
学位授予单位：中国科学技术大学
论文提交日期：2007-04-01

摘要

语音是人类交流的重要手段，是相互传递信息的最方便、最基本和最直接的途径。语音信号在传达语义信息的同时，还传递着情感信息，而情感在人们生活和交流中起着重要的角色。因此，随着人机交互技术的快速发展，语音信号中的情感信息正越来越受到研究人员的重视。作为语音信号情感信息处理的一个重要研究方向，语音情感识别是计算机理解人类情感的关键，是实现智能化人机交互的前提。但是，目前对于人类情感的研究还处于一个探索阶段，对情感的定义与表示至今没有一个统一的认识。加之情感具有较强的社会性和文化性，以及语音信号本身的复杂性，这些因素使得语音情感识别的研究面临着重重困难。应该说语音情感识别的研究还处于一个起步阶段，对于情感语音库、情感特征以及情感建模与识别方法等诸多方面的研究还有待深入。
     本文以建立不依赖于话者和文本内容的语音情感识别系统为目标，对情感语音数据库、语音声学特征参数提取、情感特征分析与选取、情感维度空间、语音情感建模与识别等问题进行了深入探讨与研究。在对大量情感语料进行分析的基础上，提出了两种语音情感建模方法，为语音情感识别提供了一个理论和技术上的框架，为实现自然的人机交互奠定了一定的基础。借助于这两种情感模型，本文开发了两种语音情感识别算法，构建了不依赖于话者和文本内容的汉语语音情感识别系统。
     本文的创新点和主要贡献如下：
     (1)从语音情感特征提取的需求出发，提出了一种基于修正倒谱和动态规划技术的基频估计算法。该算法根据倒谱、短时能量和短时过零率在清音段和浊音段的不同表现，构造了一个清浊音判决函数，据此可简化清浊音判决过程，并大大提高清浊音判决精度。为了得到合乎实际的、具有平滑轨迹的基频估计，利用动态规划技术进行基频跟踪。由于充分考虑了基频连续性的影响，该算法能够有效地避免倍频和半频错误，具有准确率高、基频轨迹平滑等优点。
     (2)对韵律和声道共振峰等语音声学特征与情感状态之间的关系进行了深入细致的定性／定量分析，得出了一些具有重要指导意义的结论。通过分析发现，短时能量虽然对于区分情感状态有一定的帮助，但存在明显不足；但是信号能量在不同频段上的分布对于区分情感状态具有重要意义，其中，250Hz以下能量占全部能量的比例是区分情感状态的一个重要特征。本文还对基频轮廓及基频轨迹导数等特征与情感状态之间的关系进行了分析。在分析过程中我们发现，男性和女性在语音情感特征参数的分布上存在着较大的差异。据此本文提出了一种以基频均值、范围和方差为特征、采用Fisher线性判别函数的性别判别方法。实验结果表明，通过训练，该方法可取得非常高的正确判别率。
     (3)提出了一个三维情感空间模型构想，通过听辨实验确定了几种基本情感在情感空间中的位置，并定量分析了语音信号的韵律特征和音质特征与不同情感维度之间的相关性。
     (4)从情感建模的角度出发，根据情感具有连续性和离散性的双重特点，将数据场的概念引入情感建模，提出了情感场和情感势的概念，并对势函数的计算方法提出了改进措施。通过势函数寻优确定各类基本情感中心在情感空间中的位置，从而把情感空间中任何一点的情感看成是由几种基本情感复合而成，每种基本情感对该点的贡献由基本情感中心在该点处的情感势决定，情感势的大小决定了该点处情感属于某种基本情感的程度。本文基于这一思想开发了一种基于情感场的汉语语音情感识别方法，获得了优于传统语音情感识别方法的识别率。
     (5)根据语音韵律特征与情感唤醒度、音质特征与愉悦度之间的相关性，提出了一种基于情感维度的情感建模方法。该方法利用韵律特征和音质特征分别为每种情感构建唤醒度和愉悦度概率模型，然后将每个情感语音样本在12个维度模型上的概率输出作为特征训练情感类别模型。本文利用高斯混合模型(Gaussian Mixture Model，GMM)构建情感维度模型，并提出了一种基于对训练样本进行聚类分析的GMM初始参数估计方法。在最后识别时，选用了支持向量机(Surport Vecter Machine，SVM)来构造六类情感类别识别器。根据该情感维度模型，本文进行了汉语语音情感识别的相关实验，获得了优于情感场方法的识别率。
     作为一种新的尝试，本文提出的两种语音情感建模方法具有一定的理论依据和较好的实用效果，为今后的语音情感建模与识别研究奠定了良好的基础。
Speech is one of the most convenient means of communication between people and it is one of the fundamental methods of conveying emotion as well as semantic information. Moreover, emotion plays an important role in communication. So emotion information processing in speech signals has gained increasing attention during the last few years as the need for machines to understand human well in human-machine interaction has grown. Being one of the most branchs of emotion information processing in speech, emotion recognition in speech is the fundemental of the nature human-machine communication. However, the research about the human emotion is still at the exploratory stage. There is still no acknowledged definition of human emotion. And emotion has strong social and culture characteristics. On the other hand, speech signals contain complex information. All of these factors are great challenges for emotion recognition in human speech, which is in its infancy.
     In order to establish a speaker independent speech emotion recognition system without getting any profit from context or linguistic information, this paper focuses on emotional speech corpus establishment, acoustic features extraction of speech, analysis and selection of emotional features, emotion dimension space, emotion modeling and emotion recognition. Based on the analysis of adequate number of emotional speech samples, two methods of emotion modeling are presented in this paper, which provide a theoretical and technical framework for emotion recognition in spoken language. Base on these studies, two emotion recognition algorithms are accomplished and a speaker and content independent Mandarin emotion recognition system is completed.
     The innovative points and main contributions of this paper are as follows:
     (1) An algorithm based on the modified cepstrum is presented for the estimation of the fundamental frequency (F0) of speech signals. Voicing decisions are made using a decision function which is composed of cepstral peak, zero-crossing rate, and energy of short-time segments of speech signals. An accurate voiced/unvoiced classification is obtained based on this decision function. Then a dynamic programming method is used to realize pitch tracking. The consecution of F0 is considered sufficiently in the cost function. The proposed algorithm can avoid the problem of pitch doubling and pitch halving effectively, as well as preserve the legitimate doubling and halving of F0. And the algorithm has some desirable advantages such as high accuracy and smooth F0 contour, which needs no further smoothing.
     (2) This paper analyzes the relationships between emotion states and speech acoustic features, including prosody and voice quality. The shortage of short-time energy on distinguishing emotion states is pointed out in this paper. On the oterh hand, we find that the proportion of energy below 250Hz to the whole is one of the potential choices for emotion recognition in speech. And the characters of the pitch contour and pitch derivative are analyzed for the purpose of emotion recognition. At the same time, the differences of emotional acoustic features between male speech and female speech are found out and a gender distinguish method is developed based on these findings. In this method, the mean, range and variance of F0 are used as features and Fisher linear discriminant function is used to distinguish male speech and female speech. Experimental results show that the proposed method gains a high accuracy.
     (3) A conception of an emotion space model based on the results from psychological research is presented and a perceptual experiment is reported. In the experiment, we have studied how the six basic emotions of Mandarin in the emotion space. Furthermore, we have studied the relationships between the prosodic and quality features and the mean ratings in the two dimensional space of arousal and valence.
     (4) From the point of view for emotion modeling, the paper uses emotion field and emtional potency to describe the emotion space, by introducing the conception of data field and potential function into the emotion modeling. Through this method, any emotion in the emotin space can be seen as the composite of all basic emotions in this research. The contribution of each basic emtion to the emotion is determined by the emotional potency which is formed by the former in the later. The center of each basic emotion is searched by hill climbing algorithm. The emotion recognition algorithm based on this model performs well than the traditional methods.
     (5) A dimension based emtoin model is presented according to the relationships between the acoustic features of speech and emotion dimensions. In this modeling method, prosodic features are used to construct the statistic arousal models and quality features are used to contruct the statistic valence modesl. Then the probability outputs of all these dimension models are considered as the features to establish the emotion category models. GMM is selected to construct the emotion dimension models and a new algorithm for the estimation of the GMM's origin parameters is proposed based on clustering method. SVM is used to establish the emotion catergory models. Experimental results indicate that the emotion recognition algorithm based on this model gains the better performance than the emotion field method.
     The two emotion modeling methods proposed in this paper, which are with scientific foundations and good performances, provide a direction for the future work of emotion recognition in spoken language.

引文

Abelin A and Allwood J (2000). Cross linguistic interpretation of emotional prosody. In: Proc. ISCA Workshop (ITRW) on Speech and Emotion: A conceptual framework for research, Belfast.
    Ahmadi S, Spanias A S (1999). Cepstrum-based pitch detection using a new statistical V/UV classification algorithm. IEEE Trans. on Speech and Audio Processing, 7(3): 333-338.
    Ambrus D C (2000). Collecting and recording of an emotional speech Database. Technical Report, Faculty of Electrical Engineering and Computer Science, Institute of Electronics, University of Maribor.
    Amir N, Ron S, Laor N (2000). Analysis of an emotional speech corpus in Hebrew based on objective criteria. In: Proc. ISCA Workshop on Speech and Emotion, Belfast. 1:29-33.
    Alku P, Vilkman E and Laine U K (1991). Analysis of glottal waveform in different phonation types using the new IAIF-method. In: Proc. 12th International Congress on Phonetic Sciences, Aix-en-Provence. 4:362-365.
    Alku P (1992). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11:109-118.
    Alter K, Rank E, Kotz S A, Pfeifer E, Besson M, Friederici A D, Matiasek J (1999). On the relations of semantic and acoustic properties of emotions. In: Proc. the 14th International Conference of Phonetic Sciences (ICPhS-99), San Francisco, California, 2121-2124.
    Alter K, Rank E, and Kotz S A (2000) Accentuation and emotions-two different systems. In: Proc. ISCA Workshop on Speech and Emotion: A conceptual framework for research, Belfast.
    Banse R and Scherer K (1996). Acoustic profiles in vocal emotion expression. J. Personality Social Psych., 70(3):614-636.
    Batliner A, Fischer K, Huber R, Spiker J and Noth E (2000). Desperately seeking emotions: actors, wizards, and human beings. In: Proc. ISCA Workshop on Speech and Emotion.
    Boersma P (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In: Proc. of the Institute of Phonetic Sciences of the University of Amsterdam, 17: 97-110.
    Burkhardt F and Sendlmeier W F (2000). Verification of acoustical correlates of emotional speech using formant synthesis. In: ISCA Workshop on Speech and Emotion, Newcastle, Northern Ireland, UK.
    边肇祺,张学工等(2000)．模式识别．北京：清华大学出版社．
    Cahn J (1989). Generating Expression in Synthesized Speech. Master's thesis. USA: MIT.
    Chen Y and He T (2005). Affective computing model based on rough sets. In: ACII 2005, LNCS 3784. 606-613.
    Choukri K (2003). European Language Resources Association, (ELRA). Available from: .
    Classen K, Dogil G, Jessen M, Marasek K, Wokurek W (1998). Stimmqualitat und wortbetonung im Deutschen. In: Linguistische Berichte 174. Westdeutscher Verlag. 202-245.
    Cole, R (2005). The CU kids' speech corpus. The Center for Spoken Language Research (CSLR). Available from: .
    Cowie R, Douglas-Cowie E (1996). Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proc. ICSLP 1996. 3:1989-1992.
    Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al. (2001). Emotion recognition in human-computer interaction. IEEE Signal Proc. Mag., 18(1):32-80.
    Cowie R, Cornelius R R (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40:5-32.
    陈罡(2003)．基于数据场的笔迹鉴别研究．硕士学位论文．南京：中国人民解放军理工大学．
    程佩青(2001)．数字信号处理教程(第二版)．北京：清华大学出版社．
    Davidson R J, Abercrombie H, Nitschke J B and Putnam K (1999). Regional brain function, emotion and disorders of emotion. Current Opinion Neurobiology, 9(2): 228-34.
    Daubechies I (1992). Ten lectures on Wavelets. Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics.
    Davitz J R (1964). Auditory correlates of vocal expression of emotional feeling. In: Davitz J R. The Communication of Emotional Meaning. New York: McGraw-Hill. 101-112.
    de Cheveignre A, Kawahara H (2002). YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 111(4): 1917-1930.
    Dellaert F, Polzin T, Waibel A (1996). Recognizing emotion in speech. In: Porc. of ICSLP'96, Delaware, USA. 1970-1973.
    Douglas-Cowie E, Cowie R, and Schroeder M (2000). A new emotion database: considerations, sources and scope. In: Proc. ISCA (ITWR) Workshop Speech and Emotion: A conceptual framework for research, Belfast. 39-44.
    Douglas-Cowie E, Campbell N, Cowie R, and Roach P (2003). Emotional speech: towards a new generation of databases. Speech Communication, 40:33-60.
    Duifhuis H, Willems L F, Sluyter R J (1982). Measurement of pitch in speech: an implementation of Goldstein's theory of pitch perception. Journal of the Acoustical Society of America, 71 (6): 1568-1580.
    丁永兴(2003)，语音情感识别技术的研究．硕士学位论文．北京：北京理工大学．
    董志峰(2004)．基于动态MFCC的说话人识别研究．硕士学位论文．合肥：中国科学技术大学．
    Edgington M (1997). Investigating the limitations of concatenative synthesis. In: Proc. Eurospeech 1997, Rhodes, Greece. 593-596.
    Engberg I S, Hansen A V (1996). Documentation of the Danish Emotional Speech database (DES). Internal AAU report, Center for Person Kommunikation, Aalborg Univ., Denmark.
    Fant G, Liljencrants J, Lin Q (1985). A four-parameter model of glottal flow. STL-QPSR 4, Speech, Music and Hearing, Royal Institute of Technology, Stockholm, 1-13.
    Fernandez R and Picard R W (2002). Modeling drivers' speech under stress. In: Proc. ISCA Workshop (ITRW) on Speech and Emotion: A conceptual framework for research, Belfast.
    Fischer K (1999). Annotating emotional language data. Tech. Rep. 236, Univ. of Hamburg.
    Frick R W (1985). Communicating emotion: the role of prosodic features. Psychological Bulletin 97, 412-429.
    Fujisaki H and Ljungqvist M (1986). Proposal and evaluation of models for the glottal source wave form. In: Proc. ICASSP-86, Tokyo, Japan. 1605-1608.
    Gershenson C (1999). Modelling emotions with multidimensional logic. In: Proc. 18th International Conference of the North American on Fuzzy Information Processing Society. 42-46.
    Gobl C and Ni Chasaide A (2003). The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 40:189-212.
    Gustafson-Capkova S (2001). Emotions in Speech: Tagset and Acoustic Correlates. Speech Technology, term paper. Autumn.
    高新波，谢维信(1999)．模糊聚类理论发展及应用的研究进展．科学通报，44(21)：2241-2251．
    顾良，刘润生(1999)．高性能汉语语音基音周期估计．电子学报，27(1)：8-11．
    谷学静，王志良，刘冀伟，刘杉(2003)．基于HMM的人工心理建模方法的研究．见：第一届中国情感计算及智能交互学术会议论文集，北京，中国．31-36．
    Hall M A, Smith L A (1999). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proc. Florida Artificial Intelligence Symposium, FLAIRS-99.
    Hansen J, Bou-Ghazale S (1997). Getting started with SUSAS: a speech under simulated and actual stress database. In: Proc. Eurospeech 1997, Rhodes, Greece. 5:2387-2390.
    Hart J T, Collier R, Cohen A (1990). A Perceptual Study of Intonation: An Experimental-Phonetic Approach to Speech Melody. New York: Cambridge University Press.
    Hermansky H (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4): 1738-1752.
    Hess U and Kirouac C (2000). Emotion expression in groups. In: Lewis M and Haviland-Jones J M, eds. Handbook of Emotions. New York: The Guilford Press. 368-381.
    Hess W (1983). Pitch Determination of Speech Signals: Algorithms and Devices. Berlin: Springer-Verlag.
    Hess W J (1992). Pitch and voicing determination. In: Furui S, Sohndi M M, eds. Advances in Speech Signal Processing. New York: Marcel Dekker. 3-48.
    Heuft B, Portele T and Rauth M (1996). Emotions in timedomain synthesis. In: Proc. ICSLP, Philadelphia, USA. 1974-1977.
    Huang X, Acero A, Hon H-W (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, New Jersey: Prentice Hall PTR.
    Huber R (1998). Prosodische Linguistische Klassifikation von Emotionen. PhD Thesis. Germany: University of Erlangen-Nuremberg.
    Iida A, Campbell N, Iga S, Higuchi F and Yasumura M (2000). A speech synthesis system with emotion for assisting communication. In: Proc. ISCA Workshop (ITRW) on Speech and Emotion: A conceptual framework for research, Belfast. 167-172.
    Iida A, Campbell N, Higuchi F, Yasumura M (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40:161-187.
    Iriondo I, Guaus R, Rodriguez A (2000). Validation of anacoustical modeling of emotional expression in Spanish using speech synthesis techniques. In: Proc. ISCA Workshop on Speech and Emotion, Belfast. 1:161-166.
    姜晓庆，田岚，崔国辉(2006)．多语种情感语音的韵律特征分析和情感识别研究．声学学报，31(3)：217-221．
    蒋丹宁，蔡莲红(2003)．基于韵律特征的汉语情感语音分类．见：第一届中国情感计算及智能交互学术会议论文集，北京，中国．217-220．
    蒋丹宁，蔡莲红(2006)．基于语音声学特征的情感信息识别．清华大学学报(自然科学版)，46(1)：86-89．
    Kawanami H, Iwami Y, Toda T, Shikano K (2003). GMM-ased voice conversion applied to emotional speech synthesis. In: Proc. Eurospeech 2003.4:2401-2404.
    Kienast M and Sendlmeier W F (2000). Acoustical analysis of spectral and temporal changes in emotional speech. In: Proc. ISCA (ITWR) Workshop Speech and Emotion: A conceptual framework for research, Belfast.
    Kiessling A (1997). Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitun. Berichte aus der Informatik, Shaker, Aachen.
    Kieβling A, Kompe R, Niemann H, Noth E, Batliner A (1992). DP-based determination of F0 contours from speech signals. In: Proc. IEEE ICASSP-92.2:17-20.
    Kim T-Y (2003). Speech production. Technical report, Intelligent Information & Signal Processing Lab, Korea University.
    Kira K, Rendell L A (1992). The feature selection problem: Traditional methods and a new algorithm. In: Proc. 9th National Conference on Artificial Intelligence. 129-134.
    Kiriyama S, Hirose K, Minematsu N (2002). Prosodic focus control in reply speech generation for a spoken dialogue system of information retrieval. In: Proc. IEEE Workshop on Speech Synthesis.
    Klasmeyer G. (I997). The perceptual importance of selected voice quality parameters. In: Proc. ICASSP-97, Munich, Germany.
    Kleinginna P R, Kleinginna A M (1981). A categorized list of emotion definitions with suggestions for a consensual definition. Motivation and Emotion, 5: 345-379.
    Kobayashi H, Shimamura T (1998). A modified cepstrum method for pitch extraction. In: The 1998 IEEE Asia-Pacific Conference on Circuits and Systems. 299-302.
    Laukkanen A-M, Vilkman E, Alku P, Oksanen H (1996). Physical variation related to stress and emotionally state: a preliminary study. Journal of Phonetics, 24: 313-335.
    Laukkanen A-M, Vilkman E, Alku P, Oksanen H (1997). On the perception of emotions in speech: the role of voice quality. Scandinavian Journal of Logopedics, Phoniatrics and Vocology, 22: 157-168.
    Laver J (1980). The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.
    Lee C M, Narayanan S S, Pieraccini R (2001). Recognition of negative emotion in the human speech signals. In: Workshop on Auto. Speech Recognition and Understanding.
    Lee C M, Narayanan S S, Pieraccini R (2002a). Classifying emotions in human-machine spoken dialogs. In: IEEE International Conference on Multimedia and Expo. 737-740.
    Lee C M, Narayanan S S, Pieraccini R (2002b). Combining acoustic and language information for emotion recognition. In: Proc. ICSLP 2002, Denver, CO.
    Lee C M (2004). Recognizing emotions from spoken dialogs: a signal processing approach. Ph.D. Thesis. USA: University of Southern California.
    Liberman M (2005). Linguistic Data Consurtium (LDC). Available from: .
    Linnankoski I, Leinonen L, Vihla M, Laakso M, Carlson S (2005). Conveyance of emotional connotations by a single word in English. Speech Communication, 45: 27-39.
    Makarova V and Petrushin V A (2002). RUSLANA: A database of Russian Emotional Utterances. In: Proc. ICSLP 2002, Colorado, USA. 2041-2044.
    McGilloway S, Cowie R, Doulas-Cowie E, Gielen S, Westerdijk M, Stroeve S (2000). Approaching automatic recognition of emotion from voice: a rough benchmark. In: Procings of the ISCA workshop on Speech and Emotion, Belfast. 207-212.
    Mehrabian A and Russel J (1974). An Approach to Environmental Psychology, Cambridge MA: MIT Press. 192-203.
    Mozziconacci S J L and Hermes D J (1997). A study of intonation patterns in speech expressing emotion or attitude: production and perception. IPO Annual Progress Report 32, IPO, Eindhoven, The Netherlands. 154-160.
    Montero J M, Gutierrez-Arriola J, Colas J, Enriquez E, Pardo J M (1999). Analysis and modelling of emotional speech in Spanish. In: Proc. Internat. Conf. on Phonetics and Speech (ICPhS '99), San Francisco. 2: 957-960.
    Mozziconacci S J L and Hermes D J (2000). Expression of emotion and attitude through temporal speech variations. In: Proc. ICSLP 2000, Beijing, China. 2: 373-378.
    Murray I, Arnott J L (1993). Towards the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. Journal of the Acoustic Society of America, 93(2): 1097-1108.
    Nakatsu R, Solomides A and Tosa N (1999). Emotion recognition and its application to computer agents with spontaneous interactive capabilities. In: Proc. IEEE Int. Conf. Multimedia Computing and Systems, Florence, Italy. 2: 804-808.
    Navas E, Hernaez I, Castelruiz A, Sanchez J and Luengo I (2004). Acoustical analysis of emotional speech in standard Basque for emotions recognition. In: Sanfliu A eds. CIARP 2004, LNCS 3287. 386-393.
    Nello C and John S-T(2004)．支持向量机导论．北京：电子工业出版社．
    Nicholson J, Takahashi K, Nakstsu R (1999). Emotion recognition in speech using neural networks. In: Proc. 6th International Conference on Neural Information Processing. 495-501.
    Niimi Y, Kasamatu M L, Nishimoto T and Araki M (2001). Synthesis of emotional speech using prosodically balanced VCV Segments. In: Proc. 4th ISCA tutorial and Workshop on research synthesis, Scotland.
    Nogueiras A, Moreno A, Bonafonte A and Marino J B (2001). Speech emotion recognition using hidden markov models. In: Proc. Eurospeech 2001, Scandinavia.
    Noll A M (1967). Cepstrum pitch determination. Journal of the Acoustical Society of America, 41(2): 293-309.
    Nwe T L, Foo S W, De Silva L C (2001). Speech based emotion classification. In: Proc. IEEE Region 10 International Conference on Electrical and Electronic Technology. 297-301.
    Nwe T L, Foo S W and De Silva L C (2003). Speech emotion recognition using hidden markov models. Speech Communication, 41: 603-623.
    Oatley K and Jenkins J M (1996). Understanding Emotions. Cambridge, MA: Blackwell.
    Ohala J J (1996). Ethological theory and the expression of emotion in the voice. In: International Conference on Spoken Language Processing, Philadelphia, USA.
    Ortony A and Turner T J (1990). What's basic about basic emotionis. Psychological Review, 315-331.
    Osgood C E, Suci J G and Tannenbaum P H (1957). The Measurement of Meaning. Urbana: University of Illinois Press. 31-75.
    Pao T-L, Chert Y-T, Yeh J-H and Lu J-J (2004). Detecting emotions in mandarin speech. In: Proc. ROCLING ⅩⅥ. 365-373.
    Pao T-L, Chen Y-T, Yeh J-H and Liao W-Y (2005), Detecting Emotions in Mandarin Speech. International Journal of Computational Linguistics and Chinese Language Processing. 10(3): 347-362.
    Paeschke A, Sendlmeier W F (2000). Prosodic characteristics of emotional speech: measurements of fundamental frequency movements. In: Proc. ISCA-Workshop on Speech and Emotion.
    Pereira C (2000). Dimensions of emotional meaning in speech. In: Proc. ISCA Workshop on Speech and Emotion: A conceptual framework for research, Belfast. 25-28.
    Petrushin V A (1999). Emotion in speech recognition and application to call centers. In: Proc. ANNIE 1999.7-10.
    Petrushin V A (2000). Emotion recognition in speech signal: experimental study, development and application. In: ICSLP 2000, Beijing, China.
    Picard R W (1997). Affective Computing. Cambridge, MA: MIT Press.
    Picar R W (2000). Toward computers that recognize and respond to user emotion. IBM Technical Journal, 38(2): 7.5-719.
    Pinto N B (1990). Unification of perturbation measures in speech signals. Journal of the Acoustical Society of America, 87(3): 1278-1289.
    Pittam J, Scherer K R (1993). Vocal expression and communication of emotion. In: Lewis M, Haviland J Meds. Handbook of emotions. New York: Guilford Press. 185-198.
    Plumpe M D, Quatieri T F and Reynolds D A (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Trans. Speech and Audio Processing. 7(5): 569-586.
    Polzin Y S and Waibel A H (1998). Detecting emotions in speech. In: Proc. CMC 1998.
    Quast H, Schreiner O, Schroeder M R (2002). Robust pitch tracking in the car environmen. In: Proc. IEEE ICASSP-02. 1: 353-356.
    钱向民(2000)．语音信号中情感信息的分析与处理．硕士论文．南京：南京航空航天大学．
    Rabiner L R (1977). On the use of autocorrelation analysis for pitch detection. IEEE Trans. on Acoustics, Speech and Signal Processing, 25(1): 24-33.
    Rabiner L R and Schafer R W (1981). 语音信号数字处理．北京：科学出版社．
    Reeves B and Nass C (1996). Media Equation. Center for the Study of Language and Information. Stanford University.
    Riegelsberge E L and Krishnamurthy A K (1993). Glottal source estimation: methos of applying the LF-model to inverse filtering. In: Proc. ICASSP-93. Minneapolis, MN, USA.
    Santos R (2002). Emotional Speech Recognition. Master Thesis. Sony International Europe GmbH.
    Scherer K R (1974). Acoustic concomitants of emotional dimensions: judging affect from synthesised tone equence. In: Weitz E S eds. Non verbal communication: Readings with commentary. New York: Oxford University Press. 105-11.
    Scherer K R (1981). Speech and emotional states. In: Darby J eds. The Evaluation of Speech in Psychiatry and Medicine. New York: Grune and Stratton. 189-220.
    Scherer K R (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin 99, 143-165.
    Scherer K R (1989). Vocal measurement of emotion. In: Plutchik R, Kellerman H eds. Emotion: Theory, Research, and Experience. San Diego: Academic Press. 4:233-259.
    Scherer K R (2000a). A cross-cultural investigation of emotion inferences from voice and speech: Implications for speech technology. In: Proc. ICSLP 2000, Beijing, China.
    Scherer K R (2000b). Emotion effects on voice and speech: paradigms and approaches to evaluation. In: Proc. ISCA Workshop on Speech and Emotion, Belfast, invited paper.
    Scherer K R and Banziger T (2004). Emotional expression in prosody: a review and an agenda for future research. In: Speech Prosody, Nara, Japan. 359-366.
    Schiel F, Steininger Silke, Turk Ulrich (2002). The Smartkom Multimodal Corpus at BAS. In: Proc. Language Resources and Evaluation, Canary Islands, Spain.
    Schroder M (2000). Experimental study of affect bursts. In: Proc. ISCA Workshop (ITRW) Speech and Emotion: A conceptual framework for research, Belfast. 132-137.
    Schroder M and Grice M (2003). Expressing vocal effort in concatenative synthesis. In: Proc. 15th Int. Conf. Phonetic Sciences, Barcelona, Spain.
    Schuller B, Rigoll G and Lang M (2003). Hidden markov model-based speech emotion recognition. In: Proc. ICASSP-03. Ⅱ:1-4.
    Stibbard R M (2001). Vocal Expression of Emotions in Non-laboratory Speech: An Investigation of the Reading/Leeds Emotion in Speech Project Annotation Data. Ph.D. Thesis. UK: University of Reading.
    Strik H, Cranen B and Boves L (1993). Fitting a LF-model to inverse filter signals. In: Proc. EUROSPEECH 1993, Berlin. 1: 103-106.
    苏庄銮(2006)．情感语音合成．博士学位论文．合肥：中国科学技术大学．
    孙即祥等(2002)．现代模式识别．长沙：国防科技大学出版社．
    孙璐(2005)．情感语音合成中的韵律研究．硕士学位论文．合肥：中国科学技术大学．
    Tao J (2003). Emotion control of Chinese speech synthesis in natural environment. In: EUROSPEECH 2003.2349-2352.
    Tao J and Kang Y (2005). Features importance analysis for emotional speech classification. In: ACII 2005, LNCS 3784. 449-457.
    Tao J, Wang J, Kang Y (2005). An expressive Mandarin speech corpus. In: Proc. The International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, Bali Island, Indonesia.
    Tao J, Kang Y and Li A (2006). Prosody conversion from neutral speech to emotional speech. IEEE Trans. Audio, Speech, and Language Processing, 14(4): 1145-1154.
    Tato R, Santos R, Kompe R, Pardo J M (2002). Emotion space improves emotion recognition. In: Proc. ICSLP 2002, Denver, Colorado. 3:2029-2032.
    Tchong C, Toen J, Kacic Z, Moreno A and Nogueiras A (2000). Emotional speech synthesis database recordings. Tech. Rep. IST-1999-No 10036-D2, INTERFACEProject.
    Tickle A (2000). English and Japanese speakers'emotion vocalisation and recognition: a comparison highlighting vowel quality. In: ICSA Workshop on Speech and Emotion, Northern Ireland. 104-109.
    Trask R L (1996). A Dictionary of phonetics and Phonology. London: Routledge.
    陶建华，许晓颖(2003)．面向情感的语音合成系统．见：第一届中国情感计算及智能交互学术会议论文集，北京，中国．191-198．
    田岚，姜晓庆，侯正信．多语种下情感语音基频参数变化的统计分析．控制与决策，20(11)：1311-1313．
    van Bezooijen R (1984). Characteristics and Recognizability of Vocal Expressions of Emotion. Foris Publications, Dordrecht.
    van Kesteren A-J, op den Akker R, Poel M, Nijholt A (2002). Simulation of emotions of agents in virtual environments using neural networks. In: Learning to Behave: Internalising Knowledge. Porc. Twente Workshops on Language Technology 18.
    Ververidis D and Kotropoulos C (2003). A state of the art review on emotional speech databases. In: Proc. 1st Richmedia Conference, Lausanne, Switzerland. 109-119.
    Ververidis D and Kotropoulos C (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48:1162-1181.
    Wang Z, Zhao L, Zou C (2003). Support vector machines for emotion recognition in Chinese speech. Journal of Southeast University (English Edition), 19(4):307-310.
    Wendt B and Scheich H (2002). The Magdeburger Prosodie-Korpus. In: Proc. Speech Prosody Conf. 2002, Aix-en-Provence, France. 699-701.
    Williams C E, Stevens K N (1972). Emotions and speech: some acoustical correlates. Journal of the Acoustical Society of America, 52:1238-1250.
    王都生，铁满霞，樊昌信(1999)．一种实用的双向跟踪基音周期平滑算法．电子学报，27(10)：108-110．
    王立志，胡桂萍，汪增福(2004)．基于时频分析的基音周期检测．见：第五届全球智能控制与自动化大会．3022-3026．
    于青(2004)．基于神经网络的汉语语音情感识别的研究．硕士学位论文．杭州：浙江大学．
    王治平，何良华，赵力，邹采荣(2003)．语音信号中情感特征的分析和识别．见：第一届中国情感计算及智能交互学术会议论文集，北京，中国．170-178．
    Xie B, Chen L, Chen G-C and Chen C (2005). Statistical feature selection for mandarin speech emotioin recognition. In: ICIC 2005, LNCS 3644. 591-600.
    谢波，韦璇，陈根才，陈纯(2003)．普通话情感语音数据库及其韵律特征的统计分析．见：第一届中国情感计算及智能交互学术会议论文集，北京，中国．221-225．
    薛为民(2003)．基于计算机视觉的情感虚拟人交互技术研究．博士学位论文．北京：北京科技大学．
    Yang L (2001). Linking form to meaning: the expression and recognition of emotions through prosody. In: Proceedings on fourth ISCA Workshop on Speech Synthesis, 2001.
    You M, Chen C, Bu J, Liu J, Tao J (2006). A hierarchical framework for speech emotion recognition. In: IEEE ISIE-06, Montreal, Quebec, Canada. 515-519.
    Yu F, Chang E, Xu Y-Q, Shum H-Y (2001). Emotion detection from speech to enrich multimedia content. In: Proc. 2na IEEE Pacific-Rim Conference on Multimedia.
    Yuan, J (2002). The acoustic realization of anger, fear, joy and sadness in Chinese. In: Proc. ICSLP 2002.3:2025-2028.
    Ziolko B, Manandhar S and Wilson R C(2006). Phoneme segmentation of speech. In: Proc. 18th ICPR.
    赵力，钱向民，邹采荣，吴镇扬(2000a)．从语音信号中提取情感特征的研究．数据采集与处理，15(1)：120-123．
    赵力，钱向民，邹采荣，吴镇扬(2000)．语音信号中的情感特征分析和识别研究．通讯学报，21(10)：18-24．
    赵力(2003)．语音信号处理．北京：机械工业出版社．32-33．
    赵元任(1933)．汉语的字调跟语调．见：中央研究院历史语言研究所集刊4(3)．
    周斌，凌震华，双志伟，王仁华(2003)．基于逆滤波和LF声源建模的语音合成器研究．见：第七届全国人机语音通讯学术会议．
    朱永崇(2005)．语音情感识别的特征分析与多子模式投票方法的研究．硕士学位论文．哈尔滨：哈尔滨工业大学．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700