语音驱动的人脸建模与动画技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

语音驱动的人脸建模与动画技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：A Study on Speech Driven Human Face Modeling and Animation
作者：李皓
论文级别：博士
学科专业名称：军事通信学
中文关键词：语音驱动 ; 人脸建模与动画 ; 主动形状模型 ; closed-form算法 ; 边界区域似然度 ; 肌肉模型 ; 声韵母切分 ; 动态视位 ; 唇部子运动
英文关键词：speech-driven ; face modeling and animating ; active shape model ; closed-form algorithm ; boundary region likelihood ; muscle model ; initial/final segmentation ; dynamic viseme ; lip sub-movement
学位年度：2011
导师：唐朝京
学科代码：110503
学位授予单位：国防科学技术大学
论文提交日期：2011-11-01

摘要

语音驱动的人脸建模与动画技术即先根据人脸信息构建适于动画的三维模型,再根据给定的语音在模型唇部产生相应的口型,从而加深人们对语言内容的理解。该技术对于推进三维游戏制作、电影配音、媒体内容修改、辅助教学和可视通信等方面应用的简易性及实用性,具有重要的理论与实际意义。
     本文以人脸建模及语音驱动动画技术研究为主线,首先提出了三维人脸模型的建立方法;然后在该模型之上,通过提取唇部运动参数及构建具有真实感的唇部动画模型控制唇部动画;最后对输入语音进行分析,提取出语音特征驱动唇部运动,并生成相应的口型。论文具体研究工作及创新包括:
     提出了一种先利用全局模板进行总体定位,再利用局部模板进行精确定位的多模板ASM算法。在局部定位过程中,首先在各模板特征点中构建窄条带,然后利用closed-form图像分割算法对窄条带区域进行纹理分割,最后利用局部模板与图像进行匹配,得到人脸特征点位置信息。实验结果表明,改进算法显著改善了传统ASM算法对纹理平滑区域特征点定位不准确的问题,提高了特征点的提取精度,由此提高了三维人脸模型建立的准确度。
     改进了传统的Mean-Shift算法对唇部进行跟踪与检测。改进的算法通过引入目标边界区域似然度及Level Set模型,实时调整跟踪窗大小,能获取说话人发音时的内外唇运动信息。在Level Set模型中将小区域放置于跟踪窗中心,联合唇部梯度信息,以及小区域与唇部边界的似然度进行唇部检测,相对于单纯采用梯度信息的唇部检测结果具有更高的准确度。通过加入ASM模型与目标边界区域似然度结合,能进一步提高外唇检测精度,从而为唇部动画提供可靠的数据支持。
     提出了肌肉模型与Mpeg-4融合的唇部动画方法。该方法在Candide-3人脸模型中定义皮肤点和骨骼点,以及肌肉控制范围,采用骨骼点限制唇部特征点运动,对肌肉控制范围以内的非特征点通过肌肉模型调整,对控制范围以外的非特征点采用唇部动画定义表调整。采用Loop细分方法,及简化的肌肉模型方法,提高了动画的细腻性和效率,实验结果证明该控制方法有效的提高了唇部动画的真实感。
     提出了一种建立损失函数,并利用浊音的“准”周期性和声母时长进行声韵母切分的方法。该方法首先计算语音的自相关函数,接着建立代价损失函数,对计算结果采用动态规划方法检测浊音,然后根据声母段长分布规律确定声母的检测范围,最后在检测范围内对浊音段起始点前后采用听觉事件检测方法分割出声韵母。实验结果表明,在浊音段的基础上对声韵母进行切分能够减少噪声及汉语音变现象的影响,提高切分的正确率,由此能提高语音驱动动画生成口型的准确性。
     提出了一种汉语动态视位的模型。根据汉语是基于音节的语言,发音过程具有“枣核型”的特点,模型分别对音节自身和音节之间的唇部运动建模。对音节通过扩展的DTW算法与子运动模型进行匹配,从而将音节利用唇部子运动模型描述。在音节间采用元音影响分级的权重函数模拟协同发音影响,先分析各元音与其后接辅音的口型影响,再通过权重函数控制实际发音口型。实验结果表明,相对于声韵母、音节对应的发音过程以及三视素表征汉语动态视位,方法提高了语音驱动动画的连贯性及合理性,适于表现汉语的协同发音现象。
Speech-driven human face modeling and animating technology belongs to the field of visual speech synthesis. Its method is firstly constructing the 3D model aiming for animation based on human face information, and then realizing the correspondent lip animation based on the given speech, which improves the audience’s comprehension of the speech. Due to its simplicity and practicality of its application, such technology is of theoretical and practical importance for promoting the 3D game making, film dubbing, network media attack and countermeasure, distant teaching and visual communications.
     The dissertation has taken the human face modeling and speech-driven animating technology as principal research subject, which firstly proposes a 3D human face modeling approach, secondly based upon such model through extracting the lip movement parameters and building a realistic lip animating model controls the lip animation, and lastly by analyzing the input speech extracts the speech features to drive the the correspondent lip animation. The main contents and innovations of this dissertation are summarized as follows:
     A refined multi-template ASM algorithm is proposed, which performs global location and local location with their templates respectively and orderly. In local location, a narrow strip-map is first constructed on the feature points of each template, then the closed-form algorithm of image segmentation is employed to perform texture segmentation on narrow strip-map, and finally local templates are matched to the image to obtain feature-point information. Experimental results show the improved algorithm effectively treats the problem that the traditional ASM algorithm works inaccurately in locating the feature points in texture smooth regions, and such algorithm also enhances the detection accuracy of all the feature points.
     The traditional Mean-Shift algorithm is improved to perform lip tracking and detecting. By the implementation of target boundary region likelihood and Level Set model, the algorithm adjusts the size of the searching window timely and obtains the moving information of a speaker’s lips. Sub-regions are placed around the center of the searching window in the Level Set model, and the likelihood between the sub-regions and the lip bounder is employed to perform lip detecting. Such method performs a more accurate contour extraction than the Level Set model that simply uses gradient information. The implementation of ASM and target boundary region likelihood ensures a more robust extraction of outer lip information, and supports data for lip animation.
     An approach incorporating the muscle model and the Mpeg-4 is proposed. The skin points and skeleton points and the controlling regions of muscles are defined in the Candide-3 face model. The skeleton points are employed to control the movements of the lip feature points. As for the points within the controlling regions of the muscles, the muscle model is used to adjust the non-feature points, and as for those points outside of the controlling regions, the lip animation definition tables are used to adjust the non-feature points. The implementation of Loop subdivision method and the simplified muscle model makes the animation more delicate and efficient. Experimental results demonstrate such controlling method effectively produces more realistic lip movements in animation.
     An initial/final segmentation method is proposed, which establishes the loss function and employs the periodicity of voiced sounds and the duration of initials and finals. The method firstly calculates the autocorrelation function of the speech, secondly establishes the loss function and performs voiced detection upon the results using the dynamic programming method, thirdly determines the detection scope of the initials according to their distributing rules, and finally segments the initials and finals from the two parts abutting on the beginning edge of the voiced region within the detection scope using the auditory event detection method. Experimental results show that the segmentation accuracy is improved as the segmentation is performed upon the voiced regions, and the impacts of noises and sound changes of Chinese are reduced, and promotes accuracy in speech-driven animation.
     A dynamic Chinese viseme model is proposed. Aiming at that Chinese is a syllabic language and its pronouncing process bears the characteristic of“rugby”, the model deals with inner-syllabic and inter-syllabic modelings respectively. As for a syllable, lip sub-movement model based on initials and finals is used, which firstly extracts the lip feature parameters of initials and finals and get the simplified viseme model by categorizing the mouth shapes according to the parameters, secondly computes the mouth shape likelihood between lip sub-movements and the pronouncing process of syllables, and finally constructs the parameter model of the lip movements by which only a small amount of parameters can control the mouth shapes of Chinese pronunciation. As for inter syllables, weighting function of vowel impact grading is used to simulate the effect of co-articulation. Experimental results show that comparing to the Chinese visemes described by phoneme or triphone model, the method promotes the animation efficiency and obtains the balance between the realistic degree of lip animation and its working speed.

引文

[1] A. Mehrabian. Communication without Words [J]. Psychology Today, 1968, 2(4): 53-56.
    [2] http://www.dynastat.com/, 2010.
    [3] F. I. Parke. Computer Generated Animation of Faces[C]// Proc. ACM Annual Conference,1972: 51-457.
    [4] A. K. Roy-Chowdhury, R. Chellappa, and T. Wide Baseline Image Registration with Application to 3-D Face Modeling[J]. IEEE Transactions on Multimedia, 2004, 6(3): 423-434.
    [5] D. C. Tao, M. L. Song, X. L. Li, et al. Bayesian Tensor Approach for 3-D Face Modeling[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(10): 1397-1410.
    [6] X. G. Lu, A. K. Jain. Deformation Modeling for Robust 3D Face Matching[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(8): 1346-1357.
    [7] B. Hong, M. Li. 3D Face Modeling from Video[C]// Proc. IEEE International Conference on Multimedia Technology, 2010: 1-4.
    [8] R. Kumar, A. Barmpoutis, A. Banerjee, and B. C. Vemuri. Non-Lambertian Reflectance Modeling and Shape Recovery of Faces Using Tensor Splines[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 533-567.
    [9] S. F. Wang, S. H. Lai. Reconstructing 3D Face Model with Associated Expression Deformation from a Single Face Image via Constructing a Low-Dimensional Expression Deformation Manifold[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(10): 2115-2121.
    [10] B. K. P. Horn. Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque Object from One View[D]. Massachusat Intitute of Technology, 1970.
    [11] Q. Zheng, R. Chellappa. Estimation of Illuminant Direction, Albedo, and Shape from Shading[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991, 13(7): 680-702.
    [12] C. H. Lee, A. Rosenfeld. Improved Methods of Estimating Shape from Shading Using the Light Source Coordinate System[J]. Articial Intelligence, 26:125-143, 1985.
    [13] P. S. Tsai, M. Shah. Shape from Shading Using Linear Approximation[J]. Image and Vision Computing Journal, 1994, 12(8):487-498.
    [14]杜志军,王阳生.单张照片输入的人脸动画系统[J].计算机辅助设计与图形学学报, 2010, 22(7): 1188-1193.
    [15] Q. S. Zhang, Z. C. Liu, B. N. Guo, et al. Geometry-Driven Photorealistic Facial Expression Synthesis [J]. IEEE Transactions on Visualization and Computer Graphics, 2006, 12(1): 48-60.
    [16]戴鹏,徐光祐.面向网上人际交流的便捷人脸动画[J].计算机辅助设计与图形学学报, 2008, 20(6): 793-800.
    [17] T. Akimoto, Y. Suenaga, and R. S. Wallace. Automatic Creation of 3D Facial Models[J]. IEEE Computer Graphics and Applications, 1993, 13(5):16-22.
    [18]梅丽,鲍虎军,郑文庭等.基于实拍图像的人脸真实感重建[J].计算机学报, 2000, 23 (9): 996-1002.
    [19] B. Lading,R. Larsen, and K. Astrom,3D Face Appearance Model[C]// Proc. 7th Nordic Signal Processing Symposium, 2006: 338-341.
    [20] A. H. Ahmed, A. A. Farag. A New Formulation for Shape from Shading for Non-Larnbertion Surfaces[C]// Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.
    [21]胡永利,尹宝才,程世铨,谷春亮,刘文韬.创建中国人三维人脸库关键技术研究[J].计算机研究与发展, 2005, 42(4): 622-628.
    [22] Y. C. Lee, D. Terzopoulos, and K. Waters. Realistic Face Modeling for Animation[C]// Proc. ACM SIGGRAPH, 1995: 55-62.
    [23] V. Blanz, T.Vetter. A Morphable Model for the Synthesis of 3D Faces[C]// Proc. ACM SIGGRAPH [C], Orlando, FL ,USA,1999:71-78.
    [24] S. Romdhani, V. Blanz, and T. Vetter. Face Identification by Fitting a 3D Morphable Model Using Linear Shape and Texture Error Functions[C]// Proc. The 7th European Conference on Computer Vision. 2002: 3-19.
    [25]丁宾,王阳生,杨明浩,姚健.从多张非标定图像重建三维人脸[J].计算机辅助设计与图形学学报, 2010, 22(2): 210-215.
    [26] F. Pighin, J. Hecker, D. Lichinski, et al. Synthesizing Realistic Facial Expressionsfrom Photographs[C]// Proc. ACM SIGGRAPH Conference Proceedings, 1998, 75-84.
    [27] Z. C. Liu, Z. Y. Zhang, C. Jacobs, and M. Cohen. Rapid Modeling of Animated Faces from Video[J]. Journal of Visualization and Compute Animation, 2001, 12(4): 227-240.
    [28]王进,彭群生,鲍虎军.基于视频的人脸表情建模研究[R].浙江大学CAD&CG国家重点实验室.
    [29] J. Z. Xia, D. T. P. Quynh, Y. He, et al. Modeling and Compressing 3D Facial Expressions Using Geometry Videos [J]. IEEE Transactions on Circuits andSystems for Video Technology, 2011.
    [30]关东东,关华勇,傅颖.一种3维动画中间帧非线性插值算法[J].中国图象图形学报, 2006, 11(12): 1820-1826.
    [31]魏毅,夏时洪,王兆其.基于物理的人体空中运动仿真[J].软件学报, 2008, 19(12): 3228-3236.
    [32] F. Pighin, J. Hecker, D. Lischinsid, et al. Synthesizing Realistic Facial Expressions from Photographs[C]// Proc. ACM Computer Graphics Annual Conference Series, SIGGRAPH 1998: 75-84.
    [33] N. M. Thalmann, H. T. Minh, M. Angelis, and D. Thalmann: Design, Transformation and Animation of Human Faces[J]. The Visual Computer, 1989, 5:32-39.
    [34] F. I. Parke. Computer Generated Animation of Faces[D], University of Utah, 1972.
    [35] K. Waters, T. M. Levergood. Decface: An Automatic Lip-Synchronization Algorithm for Synthetic Faces[R]. Cambridge Research Laboratory Technical Report Series, 1993.
    [36] J. P. Lewis, M. cordner, and N. Fong. Pose Space Deformation: a Unified Approach to Shape Interpolation and Skeleton-Driven Deformation[C]. Proc. 27th annual conference on Computer graphics and interactive techniques, 2000: 165-172.
    [37] A. Sheffer, E. Praun, and K. Rose. Mesh Parameterization Methods and their Applications[J]. Foundations and Trends in Computer Graphics and Vision, 2006, 2(2): 105-121.
    [38] Parke F I: A Parameterization Model for Human Face[D]. University of Utah, Salt Lake City, UT, 1974
    [39] C. Gotsman, X. Gu, and A. Sheffer. Foundamentals of Spherical Parameterization for 3D Meshes[C]// Proc. ACM SIGGRAPH, 2003: 358-363.
    [40] E. Praun, H. Hoppe. Spherical Parameterization and Remeshing[C]// Proc. ACM SIGGRAPH, 2003: 340-349.
    [41]姚健,王阳生,丁宾,李基拓.基于球面参数化的人脸动画重映射[J].中国图象图形学报, 2009, 14(7): 1406-1412.
    [42] P. Ekman, W. V. Friesen: Facial Action Coding System[M], Consulting Psychologists Press, Palo Alto , 1978.
    [43]虞露. MPEG-4中脸部动画参数和序列重绘的肌肉模型[J].中国图象图形学报, 2001, 6(1): 36-41.
    [44] L. J. Yin, A. Basu. MPEG4 Face Modeling Using Fiducial Points[C]//Proc. International Conference on Image Processing .1997:109-112
    [45]王奎武,冀兰芳,王涧,陈意云.基于MPEG-4的人脸变形算法的研究[J],计算机辅助设计与图形学学报, 2002, 14(1): 21-25.
    [46]王强.面向MPEG4的人脸建模方法和工具[D],清华大学, 2000.
    [47]张青山,陈国良.具有真实感的三维人脸动画[J].软件学报, 2003, 14(3): 643-650.
    [48] S. M. Platt, N. Badler. Animating Facial Expression[J]. Computer Graphics, 1981, 15(3):245-252.
    [49] K. Waters. A Muscle Model for Animating Three-Dimensional Facial Expression[J]. Computer Graphics, 1987, 21(4):17-24.
    [50] Y. C. Lee, D. Terzopoulos, and K. Waters. Realistic Modeling for Facial Animation[C]// Proc. ACM Computer Graphics Proceedings, Annual Conferences Series, 1995: 55-62.
    [51] C. Bregler, M. Covell, and M. Slaney. Video Rewrite: Driving Visual Speech with Audio [C]// Proc. ACM Computer Graphics Proceedings, Annual Conferences Series, 1997: 353-360.
    [52] M. Brand. Voice puppetry [C]// Proc. ACM Computer Graphics Proceedings, Annual Conferences Series, 1999: 21–28.
    [53] J. Ma, R. Cole, B. Pellom et al. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data [J]. IEEE Transactions on Visualization and Computer Graphics, 2006, 12 (2): 266-276.
    [54] D. M. Jiang, I. Ravyse, H. Sahli, and Y.N. Zhang. Accurate visual speech synthesis based on diviseme unit selection and concatenation [C]// Proc. IEEE 10th Workshop on Multimedia Signal Processing, 2008: 906-909.
    [55] T. Ezzat, G. Geiger, and T.Poggio. Trainable Videorealistic Speech Animation[J], ACM Transactions on Graphics, 2002, 21(3): 388-398.
    [56] E. Cosatto, H. P. Graf. Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads [C]// Proc. IEEE International Conference on Multimedia and Expo, 2000, 2: 619-622.
    [57]秦文虎,狄岚等.虚拟现实基础及可视化设计[M].北京:化学工业出版社. 2009.
    [58]王长军,朱善安.基于统计模型和GVF-Snake的彩色目标检测与跟踪[J].浙江大学学报(工学版), 2006, 11(1): 249-253.
    [59]奉小慧,贺前华,王伟凝,严乐贫.基于PS-Level Set的嘴唇几何形状定位模型[J].华南理工大学学报(自然科学版), 2010, 38(2): 121-125.
    [60] N. Eveno, A. Caplier, and P. Y. Coulon. New Color Transformation for Lips Segmentation[C]// Proc. IEEE 4th Workshop on Multimedia Signal Processing, 2001: 3–8.
    [61] H. Zhao, C. J. Tang, and T. Yu. Fast Thresholding Segmentation for Image with High Noise[C]// Proc. IEEE International Conference on Information and Automation, 2008: 290-295.
    [62]吴孙勇,廖桂生,杨志伟.改进粒子滤波的弱目标跟踪与检测[J].宇航学报, 2010, 31(10):2395-2401.
    [63] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-Based Object Tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(5): 564-575.
    [64]周维,汪增福.用于高真实感口型合成的唇区肌肉模型[J].中国图象图形学报, 2008, 13(11): 2238-2243.
    [65] T. W. Sederberg, S. R. Parry. Free-Form Deformation of Solid Geometric Models[C]// Proc. ACM Computer Graphics, Annual Conferences Series, 1986: 151-160.
    [66]尹星云. Dirichlet自由变形算法研究[J].淮南师范学院学报, 2003, 5(21): 11-13.
    [67] Z. Deng,P. Chiang,P. Fox,et a1.Animating Blend Shape Faces by Cross-Mapping Motion Capture Data[C]// Proc. ACM Computer Graphics, Interactive 3D and Games, 2006: 43-48.
    [68] P. Kalra, A. Mangili, N. M. Thalmann, and D. Thalmann. Simulation of Facial Muscle Actors Based on Rational Free-Form Deformation[C]// Proc. Computer Graphics Forum, 1992, 2(3): 59-69.
    [69] L. Moccozet, N. M. Thalmann. Dirichlet Free-Form Deformations and Their Application to Hand Simulation[C]// Proc. IEEE Computer Animation, 1997: 93-102.
    [70] V. Manohar, M. Shreve, D. Goldgof, and S. Sarkar. Modeling Facial Skin Motion Properties in Video and its Application to Matching Faces Across Expressions [C]// Proc. IEEE International Conference on Pattern Recognition. 2010,
    [71]张翔宇,华蓓,陈意云.人脸建模和动画的基本技术[J].计算机辅助设计与图形学学报, 2001, 13(4): 342-347.
    [72]深润泉.计算机人脸几何造型技术.[J].电脑知识与技术, 2010, 24(6): 6845-6846.
    [73] T. A. Faruquie, A .Kapoor,R. Kate, et al. Audio Driven Facial Animation for Audio-Visual Reality[C]// Proc. Intenrational Conference on Multimedia and EXPO, 2001: 22-25.
    [74] C. Bregler, M. Covell, and M. Slaney. Video Rewrite: Driving Visual Speech with Audio[C]// Proc. Auditory Visual Speech Processing, 1997: 25-26.
    [75]陈宝远,梁伟明.基于小波分析的语音端点检测算法研究与仿真[J].哈尔滨理工大学学报, 2009, 14(1):51-54.
    [76] T. Fukuda, O. Ichikawa, and M. Nishimura. Improved Voice Activity Detection Using Static Harmonic Features[C]// Proc. IEEE International Conference on Acoustics Speech and Signal Processing, 2010, 3:4482-4485.
    [77] H. J. Dou, Z. Y. Wu, Y. Feng, and Y. Z. Qian. Voice Activity Detection Based on the Bispectrum[C]// Proc. IEEE International Conference on Signal Processing, 2010, 10: 502-505.
    [78]王作英,肖熙.基于段长分布的HMM语音识别模型[J].电子学报, 2004: 32(1): 46-49.
    [79]邵健,赵庆卫,颜永红.基于鼻韵尾分离的汉语声韵母识别模型[J].声学学报, 2010, 35(5): 587-592
    [80]李净,郑方,张继勇等.汉语连续语音识别中上下文相关的声韵母建模[J].清华大学学报(自然科学版), 2004, 44(1): 61-64.
    [81] C. J. Chen, H. P. Li, L. Q. Shen et al. Recognize Tone Language Using Pitch Information on the Main Vowel of Each Syllable[C]// Proc. ICASSP, 2001, 1: 61-64.
    [82]柯邓峰,徐波.语音识别中的音子集设计研究[C]//第七届中国语音学学术会议暨语音学前沿问题国际论坛, 2006.
    [83]曹剑芬.汉语普通话双音子和三音子结构系统及其代表语料集[J].语言文字应用, 1997,(1):60-68.
    [84] M. Tamura, T. Masuko, T. Kobayashi, and K. Tokuda. Visual Speech Synthsis Based on Parameter Generation from HMM: Speech Driven and Text-and -Speech-Driven Approaches[C]// Proc. International Conference on Auditory-Visual Speech Processing, 1998: 219-224.
    [85] Y. Huang, S. Lin, X. Ding, B. Guo, and H. Y. Shum. Real-time Lip Synchronization Based on Hidden Markov Models[C]// Proc. Asian Conference on Computer Vision, 2002: 176-181.
    [86] T. Ohman, G. Salvi. Using HMMs and ANNs for Mapping Acoustic to Visual Speech[R]. TMH- Quarterly Progress and Status Report 1- 2/1999:45-50.
    [87]陈益强.基于数据挖掘的虚拟人多模式行为合成研究[D].中国科学院计算技术研究所, 2002.
    [88] Y. Huang, X. Ding, B. Guo, and H. Y. Shum. Real-Time Face Synthesis Driven by Voice[C]// Proc. Computer-Aided Design and Computer Graphics, 2001.
    [89] ISO/IEC 14496-1:1999, Coding of Audio-Visual Objects: Systems, Amendment 1, 1999.
    [90] M. M. Cohen, D. W. Massaro, Modeling Coarticulation in Synthetic VisualSpeech[C]// Proc. Springer-Verlag Models Techniques in Computer Animation, 1993: 139-156.
    [91] B. Le Goff, C. Benoit. A Text-to-Audiovisual-Speech Synthesizer for French[C]// Proc. International Conference on Spoken Language, 1996: 2163-2166.
    [92] J. Zhong, J. Olive. Cloning Synthetic Talking Heads[C] Proc. ESCA/COCOSDA Workshop on Speech Synthesis, 1998: 26-29.
    [93] J. D. Edge, S. Maddock. Spacetime Constraints for Viseme-based Synthesis of Visual Speech[R], CS-04-03, Department of Computer Science, University of Sheffield UK, 2004.
    [94]宋冰.语音技术在英语学习软件中的应用[D].郑州大学, 2009.
    [95] I. S. Pandzic, J. Ostermann, and D. Millen. User Evaluation: Synthetic Talking Faces for Interactive Services [J]. Visual Computer, 1999, 15 (728): 330-340.
    [96]潘红艳,柳杨华,徐光祐.人脸动画方法综述[J].计算机应用研究, 2008, 25(2): 327-331.
    [97] Y. C. Lee, D. Terzopoulos, and K. Waters. Constructing Physics-Based Facial Model of Individuals [C]// Proc. Graphics Interface, 1993: 1-8.
    [98]谢金晶,陈益强,刘军发.基于语音情感识别的多表情人脸动画方法[J].计算机辅助设计与图形学学报, 2008, 20(4): 520-525.
    [99]李树荫.实用口语学[M].北京:中国国际广播出版社, 1990.
    [100]何睦.关于“枣核形”的吐字归音方法[J].黔西南民族师范高等专科学校学报, 2004, (3): 19-20.
    [101]李皓,谢琛,唐朝京.改进的多模板ASM人脸面部特征定位算法[J].计算机辅助设计与图形学学报, 2010, 22(10): 1763-1768.
    [102]李皓,赵晖,张权,唐朝京.一种唇部自动跟踪及检测系统的设计与实现[C]//中国电子学会第十六届信息年会, 2009: 30-35.
    [103] Hao LI, Yanyan CHEN,Chaojing TANG. An Improved Mean Shift Lip Tracking and Detecting Algorithm Based on Boundary Region Likelihood and Level Set[C]// Proc. IEEE International Conference on Intelligent Computing and Intelligent Systems, 2011: 214-218.
    [104]李皓,唐朝京.基于循环自相关函数的浊音端点检测[J].计算机工程, 2011, 37(20): 5-7.
    [105] H. Li, C.J. Tang. Dynamic Chinese viseme model based on phones and control function[J]. Electronics Letters, 2011, 47(2): 144-145.
    [106] F. I. Parke. Computer Generated Animation of Faces[R]. Technical Report, AD-762022, 1972: 1-87.
    [107] G. Maestri. Digital Character Animation 2: Essential Techniques[M]. NewRiders, 1999.
    [108] F. I. Parke. Parameterized-Models for Facial Animation[J]. IEEE Computers Graphics and Application Magazine, 1982, 2(6): 61-68.
    [109] I. Kotsia, I. Pitas. Facial Expression Recognition in Image Sequences Using Geometric Deformation Features and Support Vector Machines[J]. IEEE Transactions on Image Processing, 2007, 16(1): 172-187.
    [110] K. Aizawa, H. Harashima, and T. Saito. Model-Based Synthetic Image Coding System[C]// Proc. Picture Coding Symposium, 1987.
    [111] H. S. lp Horace, C. S. Chan. Script-Based Facial Gesture and Speech Animation Using a NURBS Based Face Model[J]. Computer and Graphics, 1996, 20(6): 881-891.
    [112] M. Nahas, H. Huitric, and M. Saintourens. Animation of a B-Spline Figure[J]. 1988, 3(5): 272-276.
    [113] R. M. Koch,M. H. Gross,F. R. Carls et a1. Simulating Facial Surgery Using Finite Element Models[C]// Proc. ACM Computer Graphics, Annual Conferences Series, 1996: 421-428.
    [114] J. Mihalik. Modeling of Human Head Surface by Using Triangular B-Splines[J]. Radio Engineering, 2010, 19(1): 39-45.
    [115]胡平,曹伟国,李华.一类等距不变量及其在三维表情人脸识别中的应用[J]. 2010, 22(12): 2089-2094.
    [116] V. Blanz, T. Vetter. Face Recognition Based on Fitting a 3D Morphable Model[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2003, 25(9): 1063?1074.
    [117] M. Turk, A. Pentland. Eigenfaces for Recognition[J]. Journal of Cognitive Neuroscience, 1991, 3(1):71?86.
    [118] M. D. Cordea, E. M. Petriu, and D. C. Petrim. 3D Head Tracking and Facial Expression Recovery using an Anthropometric Muscle-based Active Appearance Model[C]// Proc. IEEE International Conference on Instrument and Measurement Technology. 2007: 1-6.
    [119] T. F. Cootes, C. J. Taylor. Multi-Resolution Search with Active Shape Models [C]// Proc. IEEE International Conference on Pattern Recognition, 1994, 1:610-612.
    [120] A. Lanitis, C. J. Taylor, and T. F. Cootes. An Automatic Face Identification Systems Using Flexible Appearance Models[C]// Proc. British Machine Vision Conference, 1994, 1: 65 -74.
    [121] N. D. Efford. Knowledge Based Segmentation and Feature Analysis of Hand and Wrist Radio-Graphs[C]// Proc. SPIE Conference on Biomedical Image Processing and Biomedical Visualization, 1993, 1: 596-608.
    [122]刘爱平,周焰,关鑫璞.改进的ASM方法在人脸定位中的应用[J].计算机工程, 2007, 18: 227-229, 241.
    [123]范玉华,马建伟. ASM及其改进的人脸面部特征定位算法[J].计算机辅助设计与图形学学报, 2007, 11: 1411-1415.
    [124] Y. Li, J. H. Lai, and P. C. Yuen. Multi-Template ASM Method for Feature Points Detection of Facial Image with Diverse Expressions[C]// Proc. International Conference on Automatic Face and Gesture Recognition, 2006, 435–440.
    [125] D. Cristinacce, T. F. Cootes. Boosted Regression Active Shape Models[C]// Proc. British Machine Vision Conference, 2007, 2: 880-889.
    [126] R. Toth, P. Tiwari, M. Rosen, et al. A Multi-Modal Prostate Segmentation Scheme by Combining Spectral Clustering and Active Shape Models[C]// Proc. SPIE Conference on Medical Imaging 2008: Imaging Processing, 2008, 6914s:4-12.
    [127] A. Levin, D. Lischinski, and Y. Weiss. A Cosed Form Solution to Natural Image Matting[C]// Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, 1: 61-68.
    [128] J. Ahlberg. Candide-3 an Updated Parameterized Face[R]. Link?ping: Image Coding Group Department of Electronic Engineering, Link?ping University, 2001.
    [129] A. L. Yuille, P. W. Hallinan, and D. S. Cohen. Feature Extraction from Faces Using Deformable Template[J]. Intemational Journal of Computer Vision, l992, 8(2): 99-111.
    [130] T. Kanada. Picture Processing System by Computer and Recognition of Human Faces[D]. KyotoUniversity, 2003.
    [131] G. C. Feng, P. C. Yuen.Variance Projection Function and its Application to Eye Detection for Human Face Recognition[J]. Pattern Recognition Letters, 1998, 19(9): 899-906.
    [132] C. Harris, M. Stephens, A Combined Corner and Edge Detector[C]// Proc. 4th Alvey Vision Conference, 1988, 147-151.
    [133]赵晖,林成龙,唐朝京.基于视频三音子的汉语双模态语料库的建立[J].中文信息学报, 2009, 23(5):98-103.
    [134] D. Comanicius, V. Ramesh, and P. Meer. Kernel-Based Object Tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(5): 564-575.
    [135]胡彬,赵欢,郑敏.一种基于空间直方图的Mean-Shift跟踪算法[J].计算机应用研究, 2010, 27(6): 2394-2400.
    [136] N. Eveno, A. Caplier, and P. Y. Coulon. New Color Transformation for LipsSegmentation[C]// Proc. IEEE 4th Workshop on Multimedia Signal Processing, 2001: 3–8.
    [137] H. Zhao, C. J. Tang, and T. Yu. Fast Thresholding Segmentation for Image with High Noise[C]// Proc. IEEE International Conference on Information and Automation, 2008: 290-295.
    [138]吴孙勇,廖桂生,杨志伟.改进粒子滤波的弱目标跟踪与检测[J].宇航学报, 2010, 10: 2395-2401.
    [139] S. Osher, J. A. Sethian. Level Sets and the Fast Marching Method Evolving Interfaces in Computational Geometry, in Fluid Mechanics Computer Vision and Materials Science[M], Cambridge: Cambridge University, 1999.
    [140] C. Li, C. Xu, C. Gui, et al. Level Set Evolution without Reinitialization a New Variational Formulation[C] Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005: 430-436.
    [141] C. Xu, J. L. Prince. Snakes, Shapes and Gradient Vector Flow[J]. IEEE Transactions on Image Processing, 1998, 7(3): 359-369.
    [142] Y. Z. Cheng. Mean Shift, Mode Seeking and Clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(8): 790-799.
    [143]彭宁嵩,杨杰,刘志等. Mean-Shift跟踪算法中核函数窗宽的自动选取[J].软件学报, 2005, 16(9): 1542-1550.
    [144]钱惠敏,茅耀斌,王执铨.自动选择跟踪窗尺度的Mean-Shift算法[J].中国图象图形学报, 2007, 12(2): 245-249.
    [145]覃剑,曾孝平,李勇明.基于边界力的Mean-Shift核窗宽自适应算法[J].软件学报, 2009, 7: 1726-1734.
    [146] N. Paragios,R. Diriche Geodesic Active Contours and Level Sets for the Detection and Tracking of Moving Objects[J],IEEE Transactions on Pattern and Machine Intelligenee, 2000, 22(3): 266-280.
    [147] ISO/IEC 14496-1: 1999, Coding of Audio-Visual Objects: Systems, Amendment 1, December 1999.
    [148] ISO/IEC 14496-1: 1999, Coding of Audio-Visual Objects: Visual, Amendment 1, December 1999.
    [149] I. S. Pandzic, R. Forchheimer. MPEG-4 Facial Animation: the Standard, Implementation and Application[M], John Wiley & Sons LTD: Sweden, 2002.
    [150] Wikipedia, http://www.wikipedia.org. 2011.
    [151] P. J. Green, R. Sibson. Computing Dirichlet Tessellations in the Plane[J] The Computer Journal, 1978, 21 (2): 168-173.
    [152] C. T. Loop. Smooth Subdivision Surfaces Based on Triangles[D]. Utah University, 1987.Segmentation[C]// Proc. IEEE 4th Workshop on Multimedia Signal Processing, 2001: 3–8.
    [137] H. Zhao, C. J. Tang, and T. Yu. Fast Thresholding Segmentation for Image with High Noise[C]// Proc. IEEE International Conference on Information and Automation, 2008: 290-295.
    [138]吴孙勇,廖桂生,杨志伟.改进粒子滤波的弱目标跟踪与检测[J].宇航学报, 2010, 10: 2395-2401.
    [139] S. Osher, J. A. Sethian. Level Sets and the Fast Marching Method Evolving Interfaces in Computational Geometry, in Fluid Mechanics Computer Vision and Materials Science[M], Cambridge: Cambridge University, 1999.
    [140] C. Li, C. Xu, C. Gui, et al. Level Set Evolution without Reinitialization a New Variational Formulation[C] Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005: 430-436.
    [141] C. Xu, J. L. Prince. Snakes, Shapes and Gradient Vector Flow[J]. IEEE Transactions on Image Processing, 1998, 7(3): 359-369.
    [142] Y. Z. Cheng. Mean Shift, Mode Seeking and Clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(8): 790-799.
    [143]彭宁嵩,杨杰,刘志等. Mean-Shift跟踪算法中核函数窗宽的自动选取[J].软件学报, 2005, 16(9): 1542-1550.
    [144]钱惠敏,茅耀斌,王执铨.自动选择跟踪窗尺度的Mean-Shift算法[J].中国图象图形学报, 2007, 12(2): 245-249.
    [145]覃剑,曾孝平,李勇明.基于边界力的Mean-Shift核窗宽自适应算法[J].软件学报, 2009, 7: 1726-1734.
    [146] N. Paragios,R. Diriche Geodesic Active Contours and Level Sets for the Detection and Tracking of Moving Objects[J],IEEE Transactions on Pattern and Machine Intelligenee, 2000, 22(3): 266-280.
    [147] ISO/IEC 14496-1: 1999, Coding of Audio-Visual Objects: Systems, Amendment 1, December 1999.
    [148] ISO/IEC 14496-1: 1999, Coding of Audio-Visual Objects: Visual, Amendment 1, December 1999.
    [149] I. S. Pandzic, R. Forchheimer. MPEG-4 Facial Animation: the Standard, Implementation and Application[M], John Wiley & Sons LTD: Sweden, 2002.
    [150] Wikipedia, http://www.wikipedia.org. 2011.
    [151] P. J. Green, R. Sibson. Computing Dirichlet Tessellations in the Plane[J] The Computer Journal, 1978, 21 (2): 168-173.
    [152] C. T. Loop. Smooth Subdivision Surfaces Based on Triangles[D]. Utah University, 1987.
    [169]邝航宇,张军,韦岗.一种基于检测元音的孤立词端点检测算法[J].电声技术, 2005, 3:40-48.
    [170] G. N. Hu, D. L. Wang. Auditory Segmentation Based on Onset and Offset Analysis[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(2):396-405.
    [171]邵健,赵庆卫,颜永红.基于鼻韵尾分离的汉语声韵母识别模型[J].声学学报, 2010, 35(5):587-592.
    [172]郑玉玲,刘佳.普通话辅音发音部位及约束研究——基于EPG[C]//第七届中国语音学学术会议暨语音学前沿问题国际论坛, 2006.
    [173]吴宗济,曹剑芬.普通话辅音声学特征的几个问题[C].第二届全国声学学术会议论文摘要, 1979.
    [174]齐士钤,张家騄.汉语普通话辅音音长分析[J].声学学报, 1982, 7(1):8-13.
    [175]冯隆.北京语音实验录[M],北京: 1985.
    [176]陈肖霞,祖漪清.基于连续话语语料库的语音音段的初步统计分析[R].社科院语言所语音室, 1998.
    [177]梅晓,熊子瑜.普通话韵律结构对声韵母时长影响的分析[J].中文信息学报, 2010, 24(4):96-103.
    [178]孙璐,胡郁,王仁华.基于eta平方的声韵母时长统计分析[C]//第七届全国人机语音通讯学术会议, 2003.
    [179]汤霖,黄建中,尹俊勋.基于语音知识的音节切分[J].中文信息学报, 2010, 24(4):91-95.
    [180]陈丽霞.基于声韵母基元的汉语语音识别系统[D].南京:南京理工大学, 2005.
    [181] X. L. Xie, G. Beni. A Validity Measure for Fuzzy Clustering[J] IEEE Transtractions on Pattern Analysis and Machine Intelligence. 1991,13(8):841- 847.
    [182]赵晖.真实感汉语可视语音合成关键技术研究[D].长沙:国防科学技术大学,2010.
    [183] J. Yamagishi, T. Nose, H. Zen, et al. Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(6): 1208-1230.
    [184] P. S. Aleksic, K. Katsaggelos. Speech-To-Video Synthesis Using MPEG-4 Compliant Visual Features [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2004, 14(5): 682-692.
    [185] M. Brand. Voiced Puppetry[C]// Proc. ACM Computer Graphics, Annual Conferences Series, 1999: 21-28.
    [186] R. R. Rao, T. Chen, and R. M. Mersereau. Audio-to Visual Conversion for Multimedia Communication[J]. IEEE Transactions on Industrial Electronics, 1998, 45(1): 15-22.
    [187] L. Xin, J. H. Tao, T. N. Tan. Dynamic Audio-Visual Mapping using Fused Hidden Markov Model Inversion Method [C]// Proc. IEEE International Conference on Image Processing, 2007, 3: 293-296.
    [188] S. Moon, J. N. Hwang. Noisy Speech Recognition Using Robust Inversion of Hidden Markov Models[C]// Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995, 1: 14-18.
    [189] S. Zhong, J. Ghosh. HMMs and Coupled HMMs for Multi-Channel EEG Classification[C]// Proc. IEEE International Joint Conference on Neural Networks, 2002, 1154-1159.
    [190] E. Yamamoto, S. Nakamura, and K. Shikano. Lip Movement Synthesis from Speech Based on Hidden Markov Models [J]. Speech Commun., 1998, 26 (1-2): 105–115.
    [191] M. Epstein, N. Hacopian, P. Ladefoge. Dissection of the Speech Production Mechanism[M]. Los Angeles: The UCLA Phonetics Laboratory, 2002:12-15.
    [192] R. Lawrence, B. H. Juang.Fundamentals of Speech Recognition[M]. Prentice-Hall International Inc, 1999: 200-238.
    [193]王志明,蔡莲红,艾海舟.基于数据驱动方法的汉语文本-可视语音合成[J].软件学报. 2005, 16(6): 1054-1063.
    [194]周维.汉语语音同步的真实感三维人脸动画研究[D].中国科学技术大学, 2008.
    [195] W. T. Liu, B. C. Yin, X. B. Jia, and D. H. Kong. A Rea1istic Chinese Talking Face [C] // Proc. 1st Indian International Conference on Artificial Intelligence, 2003: 1244-1254.
    [196]马大猷,沈豪.声学手册[M].北京:科学出版社, 1983: 404-410.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700