语音情感识别的研究与应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着人机交互技术的发展,人机接口的研究已经逐渐从机械化时代跨入多媒体用户界面时代。作为智能人机交互的关键技术之一,语音情感分析与识别已经成为一个研究热点。各领域研究者十分关注如何从语音中自动识别说话人的情感状态,并使计算机作出更有针对性和更人性化的响应。
     本文首先概述了语音情感识别的研究意义以及文中的主要研究内容,然后回顾了目前语音情感研究中涉及的多个关键问题,包括情感的分类、情感语料库概况、语音信号的声学特征、特征降维、分类算法以及基于半监督学习的语音情感分类。
     本文提出了多种特征选择和特征抽取模型。基于类集和类对特征选择相融合的语音情感识别是一种新型的模型结构,它在关注每一对类别区分度的同时,兼顾样本数据的全局分布,因而同时引入类集和类对特征选择方式。该模型结构适用于多种分类算法,而且能有效地提高系统的识别性能。基于特征投影矩阵的特征选择算法利用特征抽取算法中的投影矩阵,衡量各个初始声学特征的重要性,据此进行特征子集的选择。实验结果表明,相比于单纯使用投影矩阵进行映射变换的特征抽取方法,该特征选择算法更具优势。基于多层次特征抽取的语音情感识别通过对数据的分析,针对不同性别、不同情感类别的语料,选择多样化的降维算法进行处理。这种思想可以推广到其他语料库上,通过构建合适的基于多层次降维的识别系统,提高系统整体的识别效果。基于流形学习的增强型Lipschitz嵌入算法则是一种非线性降维算法,它通过测地距离的计算,将高维特征向量映射到低维子空间中。该算法在实验室受控环境下的特定人和非特定人语音情感识别、高斯白噪声和正弦噪声情况下的特定人语音情感识别中,显著地提高了识别准确率。
     在传统的语音情感识别系统中,各个声学特征通常是以分量的形式简单地组成特征向量,作为分类器的数据输入。基于协方差描述子和黎曼流形的语音情感识别系统考虑了不同声学特征之间的关联性,实验表明该关联性能够反映语音的情感信息,而且基于此关联性所建立的识别系统稳定性高,抗噪能力强。
     在只有少量已标记样本和大量未标记样本的情况下,本文提出增强型协同训练算法,建立起基于半监督学习的分类模型。它通过引入类别预测一致性的限制,改进标准协同训练算法,减少了分类噪音的产生,并提高了分类器的性能。
     虑到语音情感研究的实用性,使用AdaBoost+C4.5分类模型对语音信号进行情感分析,实现了完全实时的情感识别,并将其应用于实时情感语音驱动的人脸动画生成系统。
With the development of human-computer interaction technology, the research of human-computer interface has gradually entered the era of multimedia interface from the era of mechanization. As one of the key technologies in intelligent human-computer interaction, speech emotion analysis and recognition has been a hot spot. Researchers from various fields concerned about how to make the computer automatically to recognize speakers' emotional states from speech signals, and respond more targetedly and more humanly.
     The research significance of speech emotion recognition and the main research content of this paper are summarized firstly. Then we recall some key issues in the current studies of speech emotion, including the kinds of emotional states, the overview of emotional corpus, acoustic features of speech signals, feature dimensionality reduction, classification algorithm, and semi-supervised learning based speech emotion classification.
     This paper presents several models of feature selection and feature extraction. The speech emotion recognition based on a fusion of all-class and pairwise-class feature selection is a new type of model structure. It focus on the discrimination between every two emotional states, and simultaneously take the overall distribution of samples into account, so the all-class feature selection and the pairwise-class feature selection are both involved. The model structure is suitable to many classification algorithms and it can effectively improve the performance of recognition system. Feature selection based on feature projection matrix uses the projection matrix from feature extraction to evaluate the importances of initial acoustic features, and then complete feature subset selection based on the importances. The experimental results show that, compared to the feature extraction method which simply uses the projection matrix to do data mapping, this feature selection algorithm has more advantages. Through the analysis of the data, a hierarchical framework of feature extraction for speech emotion recognition selects a variety of dimensionality reduction algorithm to process different gender or different emotional states of corpus. This idea can be extended to other corpus, by constructing a suitable recognition system based on hierarchical dimensionality reductio, and it will improve recognition performance. Enhanced Lipschitz embedding algorithm based on manifold learning is a nonlinear dimensionality reduction algorithm. Through the calculation of geodesic distance, the high-dimensional feature vectors are mapped into a low-dimensional subspace. The algorithm improves the recognition accuracy dramatically in speaker-dependent and speaker-independent speech emotion recognition under controlled laboratory environment, as well as in speaker-dependent speech emotion recognition under the environment of Gaussian white noise and sinusoidal noise.
     In the traditional system of speech emotion recognition, each acoustic feature is regarded as one component of a simply composed feature vector which is the input of classifiers. Speech emotion recognition based on covariance descriptor and Riemannian manifold considers the the correlation between different acoustic features. The experimental results show that the correlation could reflect the emotional information, and the recognition system established on the correlation has high stability and anti-noise ability.
     On a small number of labeled samples and a large number of unlabeled samples, this paper presents an enhanced co-training algorithm to build a classification model based on semi-supervised learning. It introduces a restriction on label predictors to improve the standard co-training algorithm. This algorithm reduces the production of classification noises and improves the performance of classifiers.
     Considering the practicality of the researchs on speech emotion, this paper proposes a classification model of AdaBoost+C4.5 to analyze the emotional states of real-time speech signals. We realize a complete real-time emotion recognition model and apply it in a real-time facial animation system driven by emotional speech.
引文
[1] Minsk, M. L. The Society of Mind. New York: Touchstone, 1985.
    
    [2] Picard, R. W. Affective Computing. Cambridge, MA: MIT Press, 1997.
    
    [3] Izard, C. E. Human Emotions. New York: Plenum Press, 1977.
    
    [4] Krech, D., Smith, J., Crutchfield, R. S., Livson, N., Smith, M. Elements of Psychology. New York: Knopf, 1974.
    [5] Ekman, P. An Argument for Basic Emotions. Cognition and Emotion, 1992, 6: 169-200.
    [6] Plutchik, R. A General Psychoevolutionary Theory of Emotion. In Emotion: Theory, Research, and Experience, 1980,1: 3-33.
    [7] Shaver, P., Schwartz, J., Kirson, D., O'Connor, C. Emotion Knowledge: Further Exploration of a Prototype Approach. Journal of Personality and Social Psychology, 1987,52(6): 1061-1086.
    
    [8] Wundt, W. M. Outlines of Psychology. Trans. Charles Judd, 1897.
    [9] Schlosberg, H. Three Dimensions of Emotion. Psychological Review, 1954, 61: 81-88.
    [10] Douglas-Cowie, E., Cowie, R., Schroeder, M. A New Emotion Database: Considerations, Sources and Scope. In Proceedings of ISCA Workshop ITRW on Speech and Emotion, 2000: 39-44.
    [11] Iida, A., Campbell, N., Iga, S., Higuchi, F., Yasumura, M. A Speech Synthesis System with Emotion for Assisting Communication. In Proceedings of ISCA Workshop ITRW on Speech and Emotion, 2002:167-172.
    [12] Clavel, C, Vasilescu, I., Devillers, L., Ehrette, T. Fiction Database for Emotion Detection in Abnormal Situations. In proceedings of International Conference on Spoken Language Processing, 2004:. 2277-2280.
    [13] Banse, R., Scherer, K. Acoustic Profiles in Vocal Emotion Expression. Journal of Personality and Social Psychology, 1996,70(3): 614-636.
    [14] Kienast, M., Sendlmeier, W. F. Acoustical Analysis of Spectral and Temporal Changes in Emotional Speech. In Proceedings of ISCA Workshop ITRW on Speech and Emotion, 2000: 39-44.
    [15] Kawanami, H., Iwami, Y, Toda, T., Shikano, K. GMM-based Voice Conversion Applied to Emotional Speech Synthesis. In Proceedings of European Speech, 2003,4:2401-2404.
    [16] Iriondo, I., Guaus, R., Rodriguez, A., Lazaro, P., Montoya, N., Blanco, J., Beradas, D., Oliver, J., Tena, D., Longhi, L. Validation of an Acoustical Modeling of Emotional Expression in Spanish using Speech Synthesis Techniques. In Proceedings of ISCA Workshop ITRW on Speech and Emotion, 2000: 161-166.
    [17] Monterno, J. M., Gutierrez-Arriola, J., Colas, J., Enriquez, E., Prado, J. M. Analysis and Modeling of Emotional Speech in Spanish. In Proceedings of International Conference on Phonetics and Speech, 1999,2: 957-960.
    [18] Abelin, A., Allwood, J. Cross Linguistic Interpretation of Emotional Prosody. In Proceedings of ISCA Workshop ITRW on Speech and Emotion, 2000: 110-113.
    [19] Bezooijen, R. V. Characteristics and Recognizability of Vocal Expressions of Emotion. Dordrecht, The Netherlands: Foris, 1984.
    [20] Yu, F., Chang, E., Xu, Y. Q., Shum, H. Y. Emotion Detection from Speech to Enrich Multimedia Content. In Proceedings of IEEE Pacific-Rim Conference on Multimedia, 2001: 550-557.
    [21] Wu, T, Yang, Y, Wu, Z., Li, D. MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition. In Proceedings of the IEEE workshop on Speaker and Language Recognition, 2006: 1-5.
    [22] You, M., Chen, C, Bu, J. CHAD: A Chinese Affective Database. Affective Computing and Intelligent Interaction, 2005,1: 542-549.
    [23] Tao, J., Yu, J., Kang, Y An Expressive Mandarin Speech Corpus. The International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA2005, Bali Island, Indonesia, 2005.
    [24] Hansen, J. H. L. Analysis and Compensation of Speech under Stress and Noise for Environmental Robustness in Speech Recognition. Speech Communications, Special Issue on Speech under Stress, 1996, 20(2): 151-170.
    [25] Lee, C. M., Narayanan, S. S. Towards Detecting Emotion in Spoken Dialogs. IEEE Transactions on Speech and Audio Processing, 2005,13(2): 293-303.
    [26] Zeng, Y, Wu, Z., Falk, T., Chan, W. Robust GMM Based Gender Classification using Pitch and RASTA-PLP Parameters of Speech.In Proceedings of International Conference on Machine Learning and Cybernetics,2006:3376-3379.
    [27]Iwano,K.,Seki,T.,Furui,S.Noise Robust Speech Recognition Using F0Contour Extracted by Hough Transform.In Proceedings of International Conference on Spoken Language Processing,2002,2:941-944.
    [28]Mizuno,H.,Abe,M.,Hirokawa,T.Waveform-based Speech Synthesis Approach with a Formant Frequency Modification.In Proceedings of IEEE International Conference on Acoustics,Speech,and Signal Processing,1993,2:195-198.
    [29]Welling,L.,Ney,H.Formant Estimation for Speech Recognition.IEEE Transactions on Speech and Audio Processing,1998,6(1):36-48.
    [30]Makhoul,J.Linear Prediction:a Tutorial Review.In Proceedings of IEEE,1975,63:561-580.
    [31]Levinson,N.The Wiener RMS Error Criterion in Filter Design and Prediction.Mathematical Physics,1947,25:261-278.
    [32]蔡莲红,黄德智,蔡锐.现代语音技术基础与应用.清华大学出版社,2003:234-236.
    [33]Mao,X.,Chen,L.,Zhang,B.Mandarin Speech Emotion Recognition based on a Hybrid of HMM/ANN.International Journal of Computers,2007,4(1):321-324.
    [34]Pao,T.,Chen,Y.,Yeh,J.Emotion Recognition from Mandarin Speech Signals.In Proceedings of IEEE International Symposium on Chinese Spoken Language Processing,2004:301-304.
    [35]易克初,田斌,付强.语音信号处理.国防工业出版社,2003:114.
    [36]Davis,S.B.,Mermelstein,P.Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.IEEE Transactions on Acoustic,Speech and Signal Processing,1980,ASSP-28(4):357-366.
    [37]Kandali,A.B.,Routray,A.,Basu,T.K.Emotion Recognition from Assamese Speeches using MFCC Features and GMM Classifier.In Proceedings of IEEE Region 10 Conference on TENCON,2008:1-5.
    [38]Hermansky,H.Perceptual Linear Predictive(PLP) Analysis of Speech.Journal of the Acoustical Society of America,1990,87(4):1738-1752.
    [39] Petrushin, V.A. Emotion Recognition in Speech Signal: Experimental Study, Development, and Application. In Proceedings of International Conference on Spoken Language Processing, 2000: 222-225.
    [40] Amir, N. Classifying Emotions in Speech: a Comparison of Methods. In Proceedings of EUROSPEECH, 2001: 127-130.
    [41] Narayanan, S., Lee, C. M., Pieraccini, R. Recognition of Negative Emotions from the Speech Signal. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2001: 240-243.
    [42] Bellman, R. E. Adaptive Contral Processes: a Guided Tour. Princeton University Press, 1961.
    [43] Petrushin, V. Emotion in Speech, Recognition and Application to Call Centers. In Proceedings of Petrushin Artificial Neural Networks in Engineering, 1999: 7-10.
    [44] Go, H., Kwak, K., Lee, D., Chun, M. Emotion Recognition from the Facial Image and Speech Signal. In Proceedings of Annual Conference SICE, 2003, 3: 2890-2895.
    [45] Lee, J. H., Jung, H. Y., Lee, T. W., Lee, S. Y. Speech Feature Extraction using Independent Component Analysis. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, 3: 1631-1634.
    [46] Takiguchi, T., Ariki, Y. Robust Feature Extraction using Kernel PCA. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2006: 509-512.
    [47] Errity, A.,. McKenna, J. A Comparative Study of Linear and Nonlinear Dimensionality Reduction for Speaker Identification. In Proceedings of International Conference on Digital Signal Processing, 2007: 587-590.
    [48] Errity, A., McKenna, J., Kirkpatrick B. Manifold Learning-based Feature Transformation for Phone Classification. In Proceedings of an ISCA Tutorial and Research Workshop on Nonlinear Speech Processing, 2007: 132-141.
    [49] Wang, Y, Xue, L., Yang, D., Han, Z. Speech Visualization based on Locally Linear Embedding (LLE) for the Hearing Impaired. In Proceedings of International Conference on Bio-Medical Engineering and Informatics, 2008, 2: 502-505.
    [50] Jolliffe, I. Principal Component Analysis. New York: Springer-Verlag, 1986.
    [51]Martinez,A.M.,Kak,A.C.PCA versus LDA,IEEE Transaction on Pattern Analysis and Machine Intelligence,2001,23(2):228-233.
    [52]Hyvarinen,A.,Karhunen,J.,Oja,E.Independent Component Analysis.New York:John Wiley,2001.
    [53]Cardoso,J.F.,Souloumiac,A.Blind Beam Forming for Nongaussian Signals.In IEE Proceedings F of Radar and Signal Processing,1993,140:362-370.
    [54]Bell,A.,Sejnowski,T.An Information Maximization Approach to Blind Separation and Blind Deconvolution.Neural Computation,1995,7(6):1129-1159.
    [55]Girolamim,C.,Fyfe,C.Negentropy and Kurtosis as Projection Pursuit Indices Provide Generalized ICA Algorithms.NIPS Workshop on Blind Signal Separation,1996.
    [56]Scholkopf,B.,Smolla,A.,Muller,K.Nonlinear Component Analysis as a Kernel Eigenvalue Problem.Neural Computation,1998,10:1299-1319.
    [57]Tenenbaum,J.B.,de Silva,V.,Langford,J.C.A Global Geometric Framework for Nonlinear Dimensionality Reduction.Science,2000,290(22):2319-2323.
    [58]Thomas,H.C.,Charles,E.L.,Ronald,L.R.,Clifford,S.Introduction to Algorithms.MIT Press and McGraw-Hill,2001:558-601.
    [59]Balasubramanian,M.,Shwartz,E.L.The Isomap Algorithm and Topological Stability.Science,2002,295(5552):7.
    [60]Roweis,S.,Saul,L.Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science,2000,290:2323-2326.
    [61]Xu,L.,Yan,P.,Chang,T.Best First Strategy for Feature Selection.In Proceedings of International Conference on Pattern Recognition,1988,2:706-708.
    [62]Mao,K.Z.Fast Orthogonal Forward Selection Algorithm for Feature Subset Selection.IEEE Transaction on Neural Networks,2002,13:1218-1224.
    [63]Reeves,S.J.An Improved Sequential Backward Selection Algorithm for Large-scale Observation Selection Problem.In Proceedings of IEEE International Conference on Acoustics,Speech,and Signal Processing,1998,3(12-15):1657-1660.
    [64]Kira,K.,Rendell,L.A.A Practical Approach to Feature Selection.In Proceedings of International Conference on Machine Learning, 1992: 249-256.
    [65] Goldberg, D. E. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley, 1989.
    [66] Kirkpatrick, S., Gelatt, C, Vecchi, M. Optimization by Simulated Annealing. Science, 1983,220(4598): 671-680.
    [67] Quinlan, J. R. Introduction of Decision Trees. Machine Learning, 1986, 1: 81-106.
    [68] Thao, N., Bass, I., Mingkun, L., Sethi, I. K. Investigation of Combining SVM and Decision Tree for Emotion Classification. In Proceedings of IEEE International Symposium on Multimedia, 2005: 540-544.
    [69] Yacoub, S., Simske, S., Linke, X., Burns, J. Recognition of Emotions in Interactive Voice Response System. In Proceedings of EUROSPEECH, 2003: 729-732.
    [70] Duda, R. O. Hart, P. E., Stork, D. G Pattern Classification. Wiley, 2001: 177-192.
    [71] Schuller, B. Rigoll, G, Lang, M. Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine-belief Network Architecture. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004,1: 577-580.
    [72] Wang, Y, Guan, L. An Investigation of Speech-based Human Emotion Recognition. In Proceedings of IEEE Workshop on multimedia Signal Processing, 2004,1: 15-18.
    [73] Fujimoto, M., Riki, Y. A. Robust Speech Recognition in Additive and Channel Noise Environments using GMM and EM Algorithm. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, 1: 941-944.
    [74] Lee, W., Roh, Y, Kim, D., Kim, J., Hong, K. Speech Emotion Recogtion using Spectral Entropy. In Proceedings of International Conference on Intelligent Robotics and Applications, 2008: 45-54.
    [75] Reynolds, D. A., Rose, R. C. Robust Text-independent Speaker Indentification using Gaussian Mixture Speaker Models. IEEE Transaction on Speech, Audio and Processing, 1995, 3(1): 72-83.
    [76] Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., Deller J. Approaches to Language Identificaition using Gaussian Mixture Models and Shifted Delta Cepstral Features. In Proceedings of International Conference on Spoken Language Processing, 2002: 89-92.
    [77] Atsushi, N. Acoustic Modeling for Speech Recognition Based on a Generalized Laplacian Mixture Distribution. Electronics and Communications in Japan, 2002, 85(11): 32-42.
    
    [78] McLachlan, G, Peel, D. Finite Mixture Models. New York: Wiley, 2000.
    [79] Rish, I. An Empirical Study of the Naive Bayes Classifier. In Proceedings of IJCAI Workshop on Empirical Methods in Artificial Intelligence, 2001:41-46.
    [80] Toth, L., Kocsor, A., Csirik, J. On Naive Bayes in Speech Recognition. International Journal on Applied Mathematics and Computer Science, 2005, 15(2): 287-294.
    [81] Baum, L. E., Petrie, T., Soules, G, Weiss, N. A Maximization Technique Occuring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics, 1970,1:164-171.
    [82] Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of the IEEE, 1989, 77: 257-286.
    [83] de Veth, J., Bourlard, H. Comparison of Hidden Markov Model Techniques for Automatic Speaker Verification in Real-World Conditions. Speech Communication, 1995,17(1-2): 81-90.
    [84] Inanoglu, Z., Caneel, R. Emotion Alert: HMM-based Emotion Detection in Voicemail Messages. In Proceedings of Intelligent User Interfaces, 2005: 251-253.
    [85] Schuller, B. Rigoll, G, Lang, M. Hidden Markov Model-based Speech Emotion Recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003,2:1-4.
    [86] Christopher, J. A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998, 2(2): 121-167.
    [87] Camastra, F. A SVM-based Cursive Character Recognizer. Pattern Recognition, 2007,40(12): 3721-3727.
    [88] Miyao, H., Maruyama, M., Nakano, Y., Hananoi, T. Off-line Handwritten Character Recognition by SVM Based on the Virtual Examples Synthesized from On-line Characters. In Proceedings of International Conference on Document Analysis and Recognition, 2005,1: 494-498.
    [89] Hu, H., Xu, M., Wu, W. GMM Supervector Based SVM with Spectral Features for speech Emotion Recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007,4: 413-416.
    [90] Zhou, J., Wang, G, Yang, Y., Chen, P. Speech Emotion Recognition Based on Rough Set and SVM. In Proceedings of IEEE International Conference on Cognitive Informatics, 2006,1: 53-61.
    [91] Zeidenberg, M. Neural Networks in Artificial Intelligence. Ellis Horwood Limited, 1990.
    [92] Bhatti, M. W., Wang, Y, Guan, L. A Neural Network Approach for Human Emotion Recognition in Speech. In Proceedings of International Symposium on Circuits and Systems, 2004,2: 181-184.
    [93] Basu, A., Svendsen, T. A Time-frequency Segmental Neural Network for Phoneme Recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993: 509-512.
    [94] Dietterich, T. G Machine Learning Research: Four Current Directions. AI Magazine, 1997,18(4): 97-136.
    
    [95] Breiman, L. Bagging Predictors. Machine Learning, 1996,24(2): 123-140.
    [96] Witten, I. H., Frank, E. Data Mining. Morgan Kaufmann Publishers, 2005.
    [97] Freund, Y, Schapire. R. E. A Decision-theoretic Generaliation of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
    [98] Schuller, B., Reiter, S., Muller, R., Al-Hames, M., Lang, M., rigoll, G Speaker Independent Speech Emotion Recognition by Ensemble Classification. In Proceedings of IEEE International Conference on Multimedia and Expo, 2005, 1: 864-867.
    [99] Baluja, S. Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data. Neural Information Processing Systems, 1998: 854-860.
    [100] Nigam, K., McCallum, A. K., Thrun, S., Mitchell, T. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 2000, 39(2-3): 103-134.
    [101] Joachims, T. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of International Conference on Machine Learning, 1999: 200-209.
    [102] Blum, A., Chawla, S. Learning from Labeled and Unlabeled Data using Graph Mincuts. In Proceedings of International Conference on Machine Learning, 2001: 19-26.
    [103] Kapoor, A., Qi, Y., Ahn, H., Picard, R. Hyperparameter and Kernel Learning for Graph-based Semi-supervised Classification. Neural Information Processing Systems, 2006: 627-634.
    [104] Zhu, X, Ghahramani, Z., Lafferty, J. Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In Proceedings of International Conference on Machine Learning, 2003: 912-919.
    [105] Wang, B., Spencer, B., Ling, C.X., Zhang, H.Semi-Supervised Self-Training for Sentence Subjectivity Classification Canadian Conference on Artificial Intelligence, 2008.
    [106] Blum, A., Mitchell, T. Combining Labeled and Unlabeled Data with Co-training. In Proceedings of Annual Conference on Computational Learning Theory, 1998: 92-100.
    [107] Zhou, Z., Li, M. Tri-training: Exploiting Unlabeled Data using Three Classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.
    [108] Maeireizo, B., Litman, D., Hwa, R. Co-training for Predicting Emotions with Spoken Dialogue Data. In the Companion Proceedings of Annual Meeting of the Association for Computational Linguistics, 2004: 203-206.
    [109] Essid, S., Richard, G, David, B. Musical Instrument Recognition Based on Class Pair-wise Feature Selection. Proc. In Proceedings of International Conference on Music Information Retrieval, 2003: 560-567.
    [110] Guo, G, Zhang, H., Li, S. Pairwise Face Recognition. In Proceedings of International Conference on Computer Vision, 2001,2: 282-287.
    [111] Grimm, M., Kroschel, K., Narayanan, S. Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007.
    [112] Pao, T., Chen, Y., Yeh, J., Li, P. Mandarin Emotional Speech Recognition Based on SVM and NN. In Proceedings of International Conference on Pattern Recognition, 2006,1: 1096-1100.
    [113] Fukunaga, K., Hostetler, L. K-nearest-neighbor Bayes-risk Estimation. IEEE Transactions on Information Theory, 1975,21(3): 285-293.
    [114] Platt, J. Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods. MIT Press, 2000.
    [115] Cohen, I., Tian, Q., Zhou, X. S., Huang, T. S. Feature Selection Using Principal Feature Analysis. In Proceedings of IEEE International Conference in Image Processing, 2002.
    [116] Zhang, J., Yan, Y, Lades, M. Face Recognition: Eigenface, Elastic Matching, and Neural Nets. In Proceedings of the IEEE, 1997, 85(9): 1423-1435.
    [117] Etemad, K., Chellappa, R. Discriminant analysis for recognition of human face images. Journal of Optics, Image Science, and Vision of America A, 1997, 14(8): 1724-1733.
    [118] Zhao, W., Krishnaswamy, A., Chellappa, R., Serts, D. L., Weng, J. Discriminant Analysis of Principal Components for Face Recognition. In Proceedings of International Conference on Automatic Face and Gesture Recognition, 1998: 336-341.
    [119] Swets, D. L., Weng, J. Using Discriminant Eigenfeatures for Image Retrieval. In IEEE Transaction on Pattern Analysis and Machine Intelligence, 1996, 18: 831-836.
    [120] Jain, V, Saul, L. K. Exploratory Analysis and Visualization of Speech and Music by Locally Linear Embedding. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, 3: 984-987.
    [121] Togneri, R., Alder, M. D., Attikiouzel, Y. Dimension and Structure of the Speech Space. In IEEE Proceedings on Communications, Speech and Vision, 1992, 139(2): 123-127.
    [122] Bourgain, J. On Lipschitz Embedding of Finete Metric Spaces in Hilbert Space. Journal of Math ematics, 1985, 52(1-2): 46-52.
    [123] Tuzel, O., Porikli, F., Meer, P. Region Covariance: a Fast Descriptor for Detection and Classification. In Proceedings of European Conference on Computer Vision, 2006, 2: 589-600.
    [124] Fletcher, P. T., Joshi, S. Riemannian Geometry for the Statistical Analysis of Diffusion Tensor Data. Signal Processing, 2007,87(2): 250-262.
    [125] Boothby, W. M. An Introduction to Differentiable Manifolds and Riemannian Geometry. New York: Academic Press, 1975.
    [126] Pennec, X., Fillard, P., Ayache, N. A Riemannian Framework for Tensor Computing. International Journal of Computer Vision, 2006,66(1): 41-66.
    [127] Nigam, K.., Ghani, R. Analyzing the Effectiveness and Applicability of Co-training. In Proceedings of ACM International Conference on Information and Knowledge Management, 2000: 86-93.
    [128] Goldman, S., Zhou, Y. Enhancing Supervised Learning with Unlabeled Data. In Proceedings of International Conference on Machine Learning, 2000: 327-334.
    [129] Balcan, M. Blum, A., Yang, K. Co-training and Expansion: Towards Bridging Theory and Practice. Advances in Neural Information Processing Systems, 2005, 17: 89-96.
    [130] Pierce, D., Cardie, C. Limitations of Co-training for Natural Language Learning from Large Data Sets. In Proceedings of Conference on Empirical Methods in Natural Language Processing, 2001: 1-9.
    [131] Zhou, Z., Chen, K., Jiang, Y. Exploiting Unlabeled Data in Content-based Image Retrieval. In: Proceedings of European Conference on Machine Learning, 2004: 525-536.
    [132] Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G, Kollias, S., Fellenz, W., Taylor, J. Emotion Recognition in Human Computer Interaction. IEEE Signal Processing Magazine, 2001,18: 32-80.
    [133] Kwon, O. K., Chan, K. Hao, J., Lee, T. W. Emotion Recogntion by Speech Signals. In Proceedings of EUROSPEECH, 2003:125-128.
    [134] Freund, Y, Schapire, R. E. Experiments with a new boosting algorithm. In Proceedings of International Conference of Machine Learning, 1996: 148-156.
    [135] Schapire, R. E. Singer, Y, Texter, B. A Boosting-based System for Text Categorization. Machine Learning, 2000, 39(2/3): 135-168.
    [136] Xu, L., Krzyzak, C, Suen, C. Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition. In IEEE Transactions on Systems, Man, and Cybernetics, 1992, 22(3): 418-435
    [137] Viola, P., Jones, M. J. Robust Real-time Face Detection. International Journal of Computer Vision, 2004, 57(2): 137-154.
    [138] 杨国亮,王志良,任金霞.采用AdaBoost算法进行面部表情识别.计算机应用,2005,25(4):946-948.
    
    [139] Quinlan, J. R. Induction of Decision Tree. Machine Learning, 1986,1: 81-106.
    [140] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kauffman, 1993.
    [141] Mehta, M., Agrawal, R., Rissanen, J. SLIQ: a Fast and Scalable Classifier for Data Mining. In Proceedings of International Conference on Extending Database Technology, 1996: 18-32.
    [142] Shafer, J. C, Agrawal, R., Mehta, M. SPRINT: a Scalable Parallel Clas sifier for Data Mining. In Proceedings of International Conference on Very Large Database, 1996: 544-555.
    [143] Rastogi, R., Shim, K. PUBLIC: a Decision Tree Classifier that Integrates Building and Pruning. Murray Hill: Bell Laboratories, 1998.
    [144] Shapire, R. E. The Boosting Approach to Machine Learning: an Overview. MSRI Workshop on Nonlinear Estimation and Classification, 2002.
    [145] Tapia, E. M., Intille, S. S.,Haskell, W., Larson, K., Wright, J., King, A., Friedman, R. Real-time Recognition of Physical Activities and Their Intensities Using Wireless Accelerometers and a Heart Rate Monitor. In Proceedings of IEEE International Symposium on Wearable Computers, 2007: 37-40.
    [146] Brand, M. Voice Puppetry. In Proceedings of ACM SIGGRAPH, 1999: 21-28.
    [147] Hong, P., Wen, Z., Huang, T. S. Iface: a 3d Synthetic Talking Face. International Jounal of Image Graphics, 2001, 1(1): 19-26.
    [148] Chang, Y. J., Heish, C. K., Hsu, P. W., Chen, Y. C. Speech-assisted Facial Expression Analysis and Synthesis for Virtual Conferencing Systems. In Proceedings of International Conference on Multimedia and Expo, 2003, 3: 529-532.
    [149] Aleksic, P. S., Katsaggelos, A. K. Speech-to-video Synthesis Using MPEG-4 Compliant Visual Features.In IEEE Transactions on Circuits and Systems for Video Technology,2004,14(5):682-692.
    [150]Ostermann,J.,Weissenfeld,A.Talking Faces Technologies and Applications.In Proceedings of International Conference on Pattern Recognition,2004,3:826-833.
    [151]Lavagetto,F.Converting Speech into Lip Movements:a Multimedia Telephone for Hard of Hearing People.In IEEE Transactions on Rehabilitation Engineering,1995,3(1):90-102.
    [152]Gong,Y.Speech Recognition in Noisy Enviroments:a Survey.Speech Communication,1995,16(3):261-291.
    [153]Lippman,R.Speech Recognition by Machines and Humans.Speech Communication,1997,22(1):1-15.
    [154]Massaro,D.W.Perceiving Talking Faces:from Speech Perception to a Behavioral Principle.Cambridge,MA:MIT Press,1998.
    [155]陈肖霞,普通话音段协同发音研究.中国语文,1997,5:345-350.
    [156]Fisher,C.G.Confusions among Visually Perceived Consonants.In Journal of Speech and Hearing Research,1968,11(4):796-804.
    [157]Ezzat,T.,Geiger,G.Trainable Video Realistic Speech Animation.In Proceedings of IEEE Computer Animation,1998:103-110.
    [158]Provine,J.A.,Bruton,L.T.Lip Synchronization in 3D Model Based Coding for Video-conferencing.In Proceedings of IEEE Intelligence Symposium,Circuits and Systems,1995:453-456.