MPEG-4兼容的人脸语音动画系统及其在网络通信中的应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

MPEG-4兼容的人脸语音动画系统及其在网络通信中的应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：吕江波
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：MPEG4 ; 人脸建模 ; 人脸动画 ; TTS ; 纹理映射 ; 虚拟通信
英文关键词：MPEG-4 ; facial modeling ; facial animation ; TTS ; texture mapping ; virtual communication
学位年度：2003
导师：虞露
学科代码：081001
学位授予单位：浙江大学
论文提交日期：2003-03-01

摘要

MPEG-4是一个基于对象的多媒体压缩标准，允许将场景中的音频、视频对象(自然的或合成的)独立编码。MPEG-4中定义了“人脸对象”这样一个特殊的视频对象，通过脸部定义参数(FDP)和脸部动画参数(FAP)可以定制人脸模型，并使之产生动画效果。MPEG-4能够将人脸动画和多媒体通信集成在一起，并且可以在低带宽的网络上控制虚拟人脸。
     TTS(Text to Speech，文本语音合成)作为MPEG-4中引入的一种有吸引力的合成语音编码技术，它与人脸动画的结合将具有广泛的应用前景。同时，MPEG-4为TTS合成器定义了一个应用程序接口，利用这个接口，TTS合成器可以为人脸模型提供音素和相关的时间标记信息，而音素可以转换成相应的口型，这将使得人脸动画和合成语音能够很好的结合在一起。
     本文是基于我们实验室已有的研究工作，在仔细考察了人脸动画的研究现状之后，确定了“MPEG-4兼容的人脸语音动画系统及其在网络通信中的应用”作为自己的研究方向。在MPEG-4标准的范畴下把人脸动画与TTS合成语音集成在一起，不仅是项崭新的研究工作，而且它将在虚拟主持人、窄带的网络通信等中有着很好的应用。因此在研究的基础上，本人还开发出了“Grimace VTTS”和“Grimace Chat”这两个有应用潜力的原型系统。
     本文将围绕上述研究方向详细的开展如下几个方面的讨论：
     1、标准层面，对MPEG-4标准及其定义的“人脸对象”进行介绍和理解；
     2、技术要素层面，对实现真实感图像绘制的OpenGL技术、以及采用到的Microsoft Speech SDK 5.0中的TTS引擎进行研究和实践；
     3、系统架构层面，对本人提出的人脸语音动画系统(Grimace VTTS)的框架结构、以及适用于窄带网络下的可视通信系统(Grimace Chat)的框架结构进行介绍和分析；
     4、具体算法层面，其中包含脸部肌肉的运动效果模拟方法、真人照片纹理贴图的优化算法、建立发音口型库和脸部表情库的方法、过渡帧的插值算法、运动混合与协同发音的算法、表情与语音动画叠加的方法、以及系统中实现动画口型与合成发音同步的方法等；
     5、系统实现及应用层面，将详细介绍原型系统—“Grimace VTTS”和“Grimace Chat”的开发技术、系统功能、使用方法和应用场景；
     6、系统性能评价层面，将介绍人脸动画系统的主观评价结果，并首次对系统开展客观性能的评测，其中包括动画绘制帧率、函数的运行性能分析等；
     7、系统运行要求和工作展望层面，将介绍当前原型系统运行时对软、硬件平台的要求，同时对Grimace系统的发展做出展望，并将提出参考性建议。
MPEG-4 is an object-based multimedia compression standard, which allows the encoding of different audio visual objects (natural or synthetic) in the scene independently. Face object, is a special visual object defined in MPEG-4. Facial definition parameter (FDP) and facial animation parameter (FAP) are the sets of parameters to calibrate and animate the face object. MPEG-4 enables integration of face animation with multimedia communications and allows the face animation over low bit rate communication channels.
    TTS (Text to Speech) is one of the promising synthetic audio tools provided by MPEG-4, and its integration with facial animation will definitely lead to lots of applications. MPEG-4 defines an application program interface for TTS synthesizer. Using this interface, the synthesizer can be used to provide phonemes and related timing information to the face model. The phonemes are converted into corresponding mouth shapes enabling simple talking head applications.
    Taking into account of previous effort of our lab, I have made a survey of current research status about facial animation, and then I choose A MPEG-4 compatible facial animation system with TTS support and its application in network communication as my research direction. Integration of facial animation with synthetic speech will not only be a new field for our research work, but also it will serve an important role in such applications as virtual newscaster and virtual communication over low bandwidth. So I have also developed two promising prototype systems, which are called "Grimace VTTS" and "Grimace Chat" correspondingly.
    This paper will focus on the following aspects:
    1. Standard, an overview of MPEG-4 standard and basic technology about facial object of MPEG-4 are presented.
    2. Technology support, OpenGL and TTS engine of Microsoft Speech SDK 5.0 are introduced in detail, and some practice and examples will also be discussed.
    3. Framework of Grimace system, the framework of Grimace VTTS (prototype aiming at virtual newscaster) and the framework of Grimace Chat (prototype aiming at virtual communication) are proposed and each module is described.
    4. Algorithms in Grimace system, many specific algorithms adopted and optimized for Grimace system are presented and discussed.
    5. Implementations and applications, the tools used in developing Grimace system are introduced, and functions and using methods of Grimace system are described in detail.
    6. Evaluation of Grimace system, both subjective evaluation and objective evaluation of Grimace system are presented.
    7. Platform requirements and future work, run-time platform requirements of Grimace system are introduced, then future directions and my suggestion of this prototype system are presented.

引文

[1] http://www.ananova.com
    [2] http://www.cselt.it/mpeg
    [3] http://www.web3d.org
    [4] http://www.bbc.co.uk/rd
    [5] http://garuda.imag.fr/MPEG4/
    [6] http://www.hhi.de/mpeg-video
    [7] http://www.es.com/mpeg4-snhc/
    [8] http://www.tnt.uni-hannover.de/project/mpeg/audio/
    [9] http://sound.media.mit.edu/mpeg4/
    [10] http://standard.pictel.com
    [11] http://uranus.ee.auth.gr/pened99/Demos/Authoring Tool/authoring tool.html
    [12] http://www.microsoft.com/speech/SpeechSDK
    [13] http://msdn.microsoft.com/library
    [14] http://www.microsoft.com/mobile/
    [15] http://www.opengl.org
    [16] Microsoft Speech 5.0 SDK Help.
    [17] Rob Koenen. MPEG-4 Overview. ISO/IEC JTC1/SC29/WGll N4668, March 2002.
    [18] MPEG-4 Requirements Group. MPEG-4 Applications. ISO/IEC dTC1/SC29/WG11 N2724,March 1999.
    [19] A. Eleftheriadis, C. Herpel, G Rajan, and L. Ward. MPEG-4 Systems. ISO/IEC JTC1/SC29/WG11 N2201, May 1998.
    [20] F. M. Burgos and M. Kitahara. SNHC FAQ. ISO/IEC JTC1/SC29/WG11 N4894, May 2002.
    [21] A. M. Tekalp, and J. Ostermann. Face and 2-D mesh animation in MPEG-4. Signal Processing: Image Communication, 1999.
    [22] Atul Puri and Alexandros Eleftheriadis. MPEG-4: An object-based multimedia coding standard supporting mobile applications. ACM Mobile Networks and Applications. Vol. 3, p. 5-32, June 1998.
    [23] D. Terzopoulos, B. Mones-Hattal, B. Hofer, F. Parke, D. Sweetland, and K. Waters. Facial animation (panel): past, present and future. Proceedings of the 24th annual conference of Computer graphics and interactive techniques, p. 434-436, August 1997.
    [24] E I. Parke and K. Waters. Computer Facial Animation. A K Peters, 1996.
    [25] Y. Lee, D. Terzopoulos, and K. Waters. Realistic modeling for facial animation. Proc. ACM SIGGRAPH 95 Conf., p.55-62, 1995.
    [26] K. Waters and T. Levergood. Decface: A System for Synthetic Face Applications. Multimedia Tools and Applications, 1, p. 349-366, 1995.
    [27] K. Waters, J. Rehg, M. Loughlin, S. B. Kang and D. Terzopoulos. Visual Sensing of Humans for Active Public Interfaces. Digital Cambridge Research Lab TR 96/6, March 1996.
    [28] Volker Blanz and Thomas Vetter. A Morphable Model For the Synthesis Of 3D Faces. Proc. ACM SIGGRAPH 99, p. 187-194, Los Angeles, 1999.
    [29] P. Eisert and B. Girod. Model-based estimation of facial expression parameters from image sequences. Proc. International Conference on Image Processing, p. 418-421, 1997.


    [30] Tzong-Jer Yang, Fu-Che Wu, and Ming Ouhyoung. Real-Time 3D Head Motion Estimation in Facial Image Coding. Proc. of Multimedia Modeling Conference 98, p. 50-51, 1998.
    [31] P. Eisert, T. Wiegand, and B. Girod. Model-Aided Coding: A New Approach to Incorporate Facial Animation into Motion-Compensated Video Coding. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 3, p. 344-358, April 2000.
    [32] ITU-T Recommendation H.261: Video codec for audio-visual services at p'64 kbit/s, 1993.
    [33] ITU-T Recommendation H.263 Version 2(H.263+): Video coding for law bit-rate communication, Jan. 1998.
    [34] Zsofia Ruttkay and Han Noot. Animated CharToon Faces. ACM Proc. of the first international symposium on Non-photorealistic animation and rendering, p. 91-102, 2000.
    [35] Yan Li, Feng Yu, Ying-Qing Xu, Eric Chang, and Heung-Yeung Shum. Speech-Driven Cartoon Animation with Emotions. Proc. of the ninth ACM international conference on Multimedia, p. 365-371, 2001.
    [36] Keith Waters and Tom Levergood. An automatic lip-synchronization algorithm for synthetic faces. Proc. of the second ACM international conference on Multimedia, p. 149-156, 1994.
    [37] Won-Soak Lee, M. Esther, G Sannier, and N. M. Thalmann. MPEG-4 Compatible Faces from Orthogonal Photos. Proc. of international conference on Computer animation, p. 186-194, Geneva, 1999.
    [38] Michael M. Cohen and Dominic W. Massaro. Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation, N. M. Thalmann & D. Thalmann (Eds.), Tokyo: Springer-Verlag, p. 139-156, 1933.
    [39] Benoit C., Lallouache T., Mohamadi T., and Abry C. A set of French visemes for visual speech syntheis. Talking machines: Theories, Models. and Designs. G Bailly & C. Benoit, (Eds.), Amsterdam: North Holland, p. 485-504, 1993.
    [40] S. E. Boyce. Coarticulatory Organization for Lip Rounding in Turkish and English. Journal of the Acoustical Society of America, 88(6), p. 2584-2595, 1990.
    [41] J. Ostermann, M. Beutnagel, A. Fischer, and Y. Wang. Integration of talking heads and text-to-speech synthesizers for visual TTS. International Conference on Speech and Language Processing, Dec, 1998.
    [42] W. Lee, P. Kalra, and N. M Thalmann. Model based face reconstruction for animation. Proc. of MMM'9Z, p. 323-338, 1997.
    [43] S. R. Marschner, B. Guenter, and S. Raghupathy. Modeling and rendering for realistic facial animation. Rendering Techniques, p. 231-242, Springer Wien New York, 2000.
    [44] J. Ostermann, D. Millen, I. S. Pandzic. Subjective evaluation of animated talking faces. IEEE Third Workshop on Multimedia Signal Processing, p. 451-456, 1999.
    [45] J. Ostermann, D. Millen. Talking Heads and Synthetic Speech: An Architecture for Supporting Electronic Commerce. IEEE International Conference on Multimedia and Expo (Ⅰ), p. 71-74, 2000.
    [46] C. H. Cheung and L. M. Po. Text-Driven Automatic Frame Generation Using MPEG-4 Synthetic/Natural Hybrid Coding for 2-D Head-and Shoulder Scene. Proc. of IEEE International Conference on Image Processing, vol. Ⅱ, p. 69-72, 1997.
    [47] Igor S. Pandzic. Facial animation framework for the web and mobile platforms. Proc. of the seventh international conference on 3D Web technology, p. 27-34, Tempe, Arizona, 2002.
    [48] G Breton, C. Bouville, and D. Pele FaceEngine a 3D facial animation engine for real time applications. Proc. of the sixth international conference on 3D Web technology, p. 15-22, Paderbon, Germany, 2001.


    [49] K. Kahler, J. Haber, H. Yamauchi, and H. P. Seidel. Head shop: Generating animated head models with anatomical structure. Proc. ACM SIGGRAPH 2002 Conf., p. 55-64, 2002.
    [50] J. Chang, J. Jin, and Y. Yu. A Practical Model for Hair Mutual Interactions. Proc. ACM SIGGRAPH 2002 Conf, p. 73-80, 2002.
    [51] K. Fei. Expressive Textures. AFRIGRAPH 2001, p. 137-141, 2001.
    [52] G Sannier and N. M. Thalmann. A user-friendly texture-fitting methodology for virtual humans. Computer Graphics International'97, 1997.
    [53] Greg Turk. Texture Synthesis on Surfaces. Proc. ACM SIGGRAPH 2001 Conf., p. 347-354, 2001.
    [54] Darwyn R. Peachey. Solid Texturing of Complex Surfaces. Proc. ACM SIGGRAPH 85 Conf., p. 279-286, 1985.
    [55] Z. Melek and L. Akarun. Automated lip synchronized speech driven facial animation. Bogazici University Computer Engineering Department MS Thesis, 2000.
    [56] M. Escher and N. M. Thalmann. Automatic 3D Cloning and Real-Time Animation of a Human Face. Proc. Computer Animation, IEEE Computer Society, p. 58-66, 1997.
    [57] B. Guenter, C. Grimmm, D. Wood, H. Malvar, and F. Pighin. Making Faces. Proc. of the 25th annual conference on Computer graphics and interactive techniques, p. 55-66, 1998.
    [58] Y. Huang, X. Ding, B. Guo, and H. Y. Shum. Real-time synthesis driven by voice. CAD/Graphics'2001, Kunming International Academic Publishers, 2001.
    [59] S. Kshirsagar and N. M Thalmann. Virtual Humans Personified. Proc. of the first international joint conference on Autonomous agents and multiagent systems, p. 356-357, Bologna, Italy, 2002.
    [60] J. Ostermann. Animation of Synthetic Faces in MPEG-4. Computer Animation, p. 49-51, 1998.
    [61] C. Lande and G Francini. An MPEG-4 Facial Animation System Driven by Synthetic Speech. Proc. IEEE Conf. Multimedia Modeling, 1998.
    [62] M. Isenburg and J. Snoeyink. Face Fixer: Compessing Polygon Meshes with Properties. Proc. ACM SIGGRAPH2000 Conf. p. 263-270, 2000.
    [63] A. Lee, H. Moreton, and H. Hoppe. Displaced Subdivision Surfaces. Proc. ACM SIGGRAPH 2000 Conf., p. 85-94, 2000.
    [64] W. Li, Y. Q. Zhang, I. Sodagar, J. Liang, and S. Li. MPEG-4 Texture Coding. Digital Multimedia, eds. By Atul Puri, Prentice Hall, 1999.
    [65] M. Okuda and T. Chen. Joint Geometry/Texture Progressive Coding of 3D Models. IEEE Intl. Conf. on Image Processing. Vancouver, Sep. 2000.
    [66] B.Koh and T. Chen. Progressive Browing of 3D models. IEEE Signal Processing Workshop on Multimedia Signal Processing, Copenhagen, Denmark, Sep. 1999.
    [67] H. Tao, Homer H. Chen, W. Wu, and Thomas S. Huang. Compression of MPEG-4 Facial Animation Parameters for Transmission of Talking Heads. IEEE Tran. on Circuits and systems for Video technology, vol. 9, NO. 2, p. 264-276, March 1999.
    [68] F. Lavagetto and R. Pockaj. An Efficient Use of MPEG-4 FAP Interpolation for Facial Animation at 70 bits/Frame. IEEE Tran. on Circuits and systems for Video technology, vol. 11, NO. 10, p. 1085-1097, Oct. 2001.


    [69] 周峰，“脸部知识基图像编码的研究及34MB／s复合彩色电视图像DPCM编码硬件系统的研制”，浙江大学博士学位论文，1991年。
    [70] 虞露，“高清晰度电视编码仿真方法和模型基编码方法的研究”，浙江大学博士学位论文，1996年。
    [71] 刘云海，虞露，姚庆栋，“人脸序列图像的模型基编码”，计算机学报，Vol．23，No．12，2000年。
    [72] 虞露，“MPEG-4中脸部动画参数和序列重绘的肌肉模型”，中国图象图形学报，Vol.6，No．1，2001年。
    [73] 章璟裕，“基于MPEG-4的人脸表情模型基编码系统”，浙江大学硕士学位论文，2000年3月。
    [74] 虞非，“脸部表情数学模型的探讨与算法的实现”，杭州电子工业学院硕士学位论文，1999年1月。
    [75] 张翔宇，华蓓，陈意云，“人脸建模和动画的基本技术”，计算机辅助设计和图形学学报，Vol．13，No．4，2001年4月。
    [76] 张翔宇，林志勇，华蓓，陈意云，“从正面侧面照片合成三维人脸”，计算机应用，Vol．20，No．7，2000年7月。
    [77] 王奎武，王洵，董兰芳等，“一个MPEG-4兼容的人脸动画系统”，计算机研究与发展，Vol．38，No．5，2000年。
    [78] 周敬利，罗为民，余胜生，“多媒体通信新标准—MPEG-4”，电子计算机与外部接口，Vol．23，No．3，1999年。
    [79] 张翔宇，林志勇，华蓓等，“OpenGL在人脸建模和动画中的应用”，计算机应用，Vol．20，No．8，2000年。
    [80] 周振红，周洞汝，杨国录，“基于COM的软件组件”，计算机应用，Vol．21，No．3，2001年。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700