视觉语音合成技术在英语发音辅导中的应用探究

英文题名：Visual Speech Synthesis Technology and Its Application Studies in English Pronunciation Tutoring
作者：许芹
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：英语发音 ; 视位(Viseme) ; 文本—视觉语音合成(TTVS)
英文关键词：English Pronunciation ; Viseme ; TTVS(Text-To-Visual Speech)
学位年度：2007
导师：张际平
学科代码：081001
学位授予单位：华东师范大学
论文提交日期：2007-03-01
答辩委员会主席：张琴珠

摘要

随着全球一体化进程的迅速推进；我国与世界各地之间的交流日益频繁，英语作为国际通用的工作语言越来越受到人们的重视。但是，由于多年只重视书面教学，和缺乏良好的口语学习环境，致使我国当前的英语口语教学收效甚微。虽然计算机技术在我国发展的如火如茶，但是我国的计算机辅助语言学习(CALL，Computer-Assisted Language Learning)却仍然停留在起步阶段。
     针对这一现状，笔者将视觉语音(Visual Speech)技术应用于英语初学者的语音教学。本文参考美国大力推广的Phonics教学法，开发了一个唇形—语音同步的英语发音辅导系统，希望从以下两方面帮助英语初学者学习语音：一是根据语音的双模态特性，视觉语音可以帮助用户更好的观察、模仿脸部发音动作，有助于用户理解、记忆语音；二是借助于视觉语音技术呈现的用户界面更加友好，人机交互更加和谐、自然，这样对于缓解英语初学者的压力，提高学习者的学习积极性有很大帮助。
     本文所做的工作主要有以下几点：
     (?)基础标准层面，对MPEG-4定义的“人脸对象”进行介绍并以该定义中人脸动画的参数(FAP)为基础开展后面的工作；
     (?)技术要素层面，对本文采用的Microsoft Speech SDK 5.1中的TTS引擎进行研究和实践；
     (?)系统架构层面，对本文提出的视觉语音合成系统(TTVS)的框架结构、进行介绍和分析；
     (?)具体算法层面，详细介绍实现视觉语音动画合成系统的步骤和算法等；
     (?)系统应用层面，将详细介绍“EP Tutor”系统的知识结构、各模块功能及其应用场景；
     (?)工作展望层面，将对EP Tutor系统进一步的发展做出展望。
With the rapid globalization process, China's foreign exchanges have become increasingly frequent. As a common working language, English has been paid increasing people's attention on. However, our only concern about written English and the lack of English speaking environment led to little achievement in Spoken English teaching in China recently. Although the development of computer technology in China is in full swing, the computer assistance pronuciation learning is still lingering in the initial stage.
    In view of this situation, we applied the visual speech technology to tutor English pronunciation for the beginners. Referring to the Phonics pedagogy popularized in US, we developed an English Pronunciation tutoring system based on Lips-simultaneous voice. It will help beginners learn English pronunciation at the following two aspects:
    First, according to the dual-modal characteristics of voice, visual speech will be a great benefit to users not only in observation and imitation simulation, but also in comprehension and recollection.
    Second, the user interface using Visual Speech technology will be more friendly, and the human-computer interaction will be more harmonious. This natural system will be of great help to ease the pressure and to fire up the enthusiasm of the beginners in their pronunciation learning.
    This paper will focus on the following aspects:
    A. Standard, the basic technology about facial object of MPEG-4 is presented.
    B. Technology support, TTS engine of Microsoft Speech SDK 5.1 is introduced in detail, and some practice and examples will also be discussed.
    C. Framework of system, the framework of TTVS(Text-To-Visual Speech) system is proposed and described.
    D. Algorithms, many specific algorithms adopted and optimized for TTVS system are presented and discussed.
    E. Applications, the knowledges, functions and using scenes of EP Tutor system are described.
    F. Future work, EP Tutor system is introduced, and future directions of this system are presented.

引文

[1] 周考成．英语语音学引论[M]．上海：上海外语教育出版社，1984．11-13．
    [2] G. Reid Lyon. Overview of NICHD Reading and Literacy Initiatives [M].USA :the U.S. Senate Committee on Labor and Human Resources, 1998.
    [3] 章兼中．外语教学心理学[M]．安徽：安徽教育出版社，1986年．349-382．
    [4] Fisher. Confusions among visually perceived consonants[J]. Journal of Speech&Hearing Research, 1968,(15).474-482
    [5] http://www.betteraccent.com/
    [6] http://www.auralog.com
    [7] 桌上语音工作室(MiniSpeechLab)中国南开大学中文系语音学实验室开发的语音分析研究软件。http://zgyw.freeservers.com/software.html
    [8] 沙国泉．计算机辅助语音训练与测试：问题与思考[J]．外语电化教学，2005，4
    [9] http://garuda.imag.fr/MPEG4/
    [10] K. Waters, J. Rehg, M. Loughlin, S. B. Kang, and D. Terzopoulos, "Visual Sensing of Humans for Active Public Interfaces", Digital Cambridge Research Lab TR 95/6, March 1996.
    [11] MPEG-4，国际标准ISO／IEC 14496[S]．
    [12] D. Terzopoulos, B. Mones-Hattal, B. Hofer, F. Parke, D. Sweetland, and K. Waters. "Facial Animation (panel): past, present and future". Proceedings of the 24th annual conference of Computer graphics and interactive technigues, pp. 434-436, August 1997.
    [13] M. M. Cohen and D. W. Massaro. "Modeling coarticulation in synthetic visual speech",In N. M. Thalmann& D. Thalmann Models and Techniques in Computer Animation. Tokyo:Springer-Verlag, pp. 139-156,1993.
    [14] Tony Ezzat, Gadi Geiger, and Tomaso Poggio, "Trainable Videorealistic Speech Animation", Appeared in Proceedings of ACM SIGGRAPH 2002, San Antonio, Texas, July 2002.
    [15] “文本-可视语音转换及其应用”王志明蔡莲红 2001／06／04 《计算机世界》
    [16] Y. Lee, D. Terzopoulos, and K. Waters, "Realistic modeling for facial animation", Proc. ACM SIGGRAPH 95 Conf., pp. 55-62, 1995.
    [17] Koster, Barrett E., Rodman, Robert D. and Bitzer, "Automated Lip-Sync:Direct Translation of Speech-Sound to Mouth-Shape". Proceedings of the 28th Annual Asilomar Conference on Signals, Systems and Computers. IEEE publication. 1994.
    [18] F. I. Parke and K. Waters, "Computer Facial Animation", A K Peters, 1996.
    [19] K. Waters and T. Levergood, "DECface: A System for Synthetic Face Applications", Multimedia Tools and Applications, 1, pp. 349-366, 1995.
    [20] Daniel Jackson "Facial Animation Research", Doctor' Degree paper, summer 2000
    [21] (美)简·凡桑塔(Jan P．H．van Santen)等编；蔡莲红等译．《语音合成》[M]．北京：机械工业出版社，2005．
    [22] http://www.chiariglione.org/MPEG/
    [23] 陈益强．基于数据挖掘的虚拟人多模式行为合成的研究[D]．北京：中国科学计算技术研究博士学位论文，2002：27-41，82-90
    [24] Tony Ezzat, Tomaso Poggio. "MikeTalk: A Talking Facial Display Based on Morphing Visemes" [C].Proceedings of the Computer Animation Conference, Philadelphia, Pennsylvania, 1998.7:335-341
    [25] Cosato. "Photo-realistic talking-heads from image samples", [J].Graf H.P. Multimedia, IEEE Transactions, 2000,2(3):152-163
    [26] Nakamura, Breglert. "HMM-based transmodal mapping from audio speech to talking faces" [C]. Neural Networks for Signal Processing X Proceedings of the 2000 IEEE Signal Processing Society Workshop, 2000.1, pp. 33-42
    [27] 王洵．人脸建模与动画的研究[J]．中国科学技术大学博士学位论文，2001：46-56
    [28] Zsofia Ruttkay and Han Noot. Animated CharToon Faces. ACM Proc. of the first international symposium on Non-photorealistic animation and rendering, p. 91-102,2000.
    [29] K. Waters and T. Levergood. Decface:A System for Synthetic Face Applications. Multimedia Tools and Applications, 1, pp. 349-366,1995.
    [30] http://www.ananova.com
    [31] F. I. Parke, Parameterized models for facial animation. IEEE Computer Graphics and Applications, 1982, vol. 2(9) pp. 61—68
    [32] K. Masse, A. Pentland, Automatic Lip reading by Computer, Trans. Inst. Elec., Info. And Comm. Eng. 1990. Vol. J73-D-Ⅱ, No. 6. pp. 796-803
    [33] Y.C. Lee, D. Terzopoulos, K. Waters. Realistic face modeling for animation[C]. Siggraph proceedings, 1995:55-62
    [34] F. I. Parke. "Computer generated animation of face" [D].Salt Lake City UT:University of Utah UT, 1974:72-120
    [35] Rob Koenen. "MPEG-4 Overview". ISO/IEC JTCI/SC29/WG11 N4668, March 2002.
    [36] MPEG-4 Requirements Group. "MPEG-4 Applications". ISO/IEC JTCI/SC29/WG11 N 2724, March 1999.
    [37] Jiang Dalong, Li Zhiguo, Wang Zhaoqi, Gao Wen. "Animating 3D Facial Models with the MPEG-4 FaceDefTable" [C]. 35th annual simulation symposium, San Diego, California, 2002:395-400
    [38] Fabio Lavagetto, Roberto Pockaj. The Facial Animation Engine:Toward a Heigh-Level Interface for the Design of MPEG-4 Compliant Animated Faces. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL 9, NO. 2, MARCH 1999.
    [39] 易克初，田斌，付强．《语音信号处理》[M]．北京：国防工业出版社，2000．
    [40] 陈涛，尹波．《语音合成开发包(TTS API SDK)开发使用规范》．安徽：中科大讯飞信息科技股份有限公司，2000：3-19
    [41] Microsoft Speech SDK 5.1 Help. Microsoft Corporation MSDN(msdn. Microsoft. com)
    [42] 周振红，周洞汝，杨国录，“基于COM的软件组件”，计算机应用，Vol．21，No．3，2001年
    [43] http://sound.media.mit.edu/mpeg4/
    [44] 周敬利，罗为民，余胜生，“多媒体通信新标准—MPEG-4，电子计算机与外部接口，Vol．23，No．3，1999年。
    [45] 周长发．《Visual C++．NET多媒体编程》[M]．北京：电子工业出版社，2002年，pp．419-447
    [46] 蔡莲红，黄德智，蔡锐．《现代语音技术基础与应刚》[M]．北京：清华大学出版社，2003年．pp．182．
    [47] Carnegie Mellon University. The CMU Pronunciation Dictionary http://www.speech.cs.cnu.edu/cgi-bin/cmudict, 06-16-2000.
    [48] ISO/IEC 14496-2: 1999, Coding of Audio-Visual Objects: Visual, Amendment 1[S]. December 1999
    [49] The University of Iowa. The Phonetics Librarieshttp://www.uiowa.edu/~acadtech/phonetics/english/frameset.html
    [50] A. M. Tckalp, and J. Ostcrmann. Face and 2-D mesh animation in MPEG-4 Signal Processing:Image Communication, 1999.
    [51] 方淑珍《英语语音学基础》，1985．12，广州：广东教育出版社

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700