真实感汉语可视语音合成关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
可视语音合成又称语音动画合成,是指根据给定的文本或语音,合成出与文本或语音相对应的脸部图像序列,加深人们对语言内容的理解。可视语音合成技术在人机交互、影视娱乐、信息对抗等领域有着重要的应用。
     本文提出了汉语大规模双模态语料库的设计方案和彩色噪声图像唇部提取方法,在此基础上,提出了多种真实感汉语可视语音合成方法,设计实现了一个以可视语音合成技术为核心的演示系统。实验结果证明本文的可视语音合成方法能够实时、精确、有效地达到信息欺骗等目标。本文的研究工作包括:
     针对人脸彩色噪声图像,提出了基于峰值趋势检测分割的唇部提取方法。峰值趋势检测分割方法由平行线投影分割算法和基于直方图的加权模糊聚类分割算法构成。平行线投影分割算法的核心思想是根据映射规则,将二维直方图转换为一维直方图,结合了二维直方图分割方法的准确性和一维直方图分割方法的实时性。实验结果证明该唇部提取方法的准确率高,能够为真实感汉语可视语音合成提供精确的唇部坐标信息,并用于语料库的唇部素材选取。
     提出了大规模汉语双模态语料库Bi-VSSDatabase的设计方案。制定了原始语料选取原则和组成文件的命名规则;提出了基于人工免疫混合聚类的口型特征参数聚类方法;建立了能够反映汉语协同发音现象的三视素模型,并据此提出双模态语料精选算法;设计了双模态语料标注及切分方法。对覆盖率、覆盖效率等统计指标进行计算,计算结果证明了Bi-VSSDatabase能够为真实感汉语可视语音合成提供真实准确、有广泛代表性的双模态语料。提出了三种语音驱动的可视语音合成方法:HMM模型状态合成方法、混合参
     数合成方法和双层HMM模型合成方法;提出了两种文本驱动的可视语音合成方法:基于HMM模型的方法和基于单元拼接的方法,设计了拼接单元搜索流程,定义了拼接单元的拼接规则。分别以汉语三视素和汉语动态视素作为训练与合成的基本单元。基于三视素的合成序列的主观满意度和客观评测结果都达到良好以上,证明了所提出的方法能够合成平滑、连续、令人满意的口型序列。针对口型序列与背景视频的缝合问题,提出了基于快速行进算法的唇部区域修补方法,合成了完整、自然、流畅的说话人视频。
     提出了一种基于改进乘积HMM的可视语音质量客观评估方法,能够模拟人们对说话人视频的视觉-听觉感知过程,并从客观角度给出评估结果。在评估过程中,比较分析了本文几种可视语音合成方法的质量,证明了可视语音合成技术能够极大地提高人们,尤其是听障人士对语音内容的理解能力。
Visual speech synthesis can be called speech animation. Visual speech synthesis technology is to synthesize visual image sequence according to the given text or speech, which can deepen people’s language comprehension. Visual speech synthesis technology plays important roles on domains of human-computer interaction, movie and entertainment, information countermeasure and so on.
     A large-scale Chinese bimodal database is designed and a mouth segmentation approach in color image with noise is proposed. Based on them, several realistic Chinese visual speech synthesis approaches are proposed in this dissertation. Also, a demonstration system is designed, in which visual speech synthesis is the key technology. The experimental results show that, aiming at information spoofing, the proposed visual speech approaches is fast, exact and efficient. The main contents of this dissertation are summarized as follows:
     In order to get mouth area in color image with noise, a thresholding segmentation algorithm based on peak clustering tendency test is proposed. The thresholding segmentation algorithm is composed of two algorithms: parallel projection segmentation algorithm and weighting Fuzzy c-Means clustering algorithm based on histogram. Parallel projection segmentation algorithm is used to project two-dimension histogram into one-dimension histogram according to mapping rule, and the algorithm is proved to satisfy the accuracy of two-dimension histogram segmentation approach and real-time performance of one-dimension histogram. The experimental results show that accuracy of mouth segmentation is high, which is able to provide accurate mouth coordinate information. Meanwhile, the proposed approach can be used to select mouth corpus for bimodal database.
     A large-scale Chinese bimodal database -- Bi-VSSDatabase is designed. Original corpus selection rule and the composed document naming rule are made; Mouth feature parameter clustering approach based on artificial immune system is proposed; Chinese triphone model is built, which can reflect Chinese coarticulation characteristics. Based on Chinese triphone model, bimodal corpus selection algorithm is proposed; Then, bimodal corpus marking and segmentation approach is designed. Several statistical indicators, such as coverage rate, coverage efficiency, are calculated. Experimental results of these statistical indicators show that Bi-VSSDatabase is able to provide sufficient, exact and representative bimodal corpus for realistic Chinese visual speech synthesis.
     Three speech-driven visual speech synthesis approaches are proposed: hidden Markov model (HMM) state synthesis approach, mixing parameter synthesis approach and two-layer HMM synthesis approach. Two text-driven visual speech synthesis approaches are proposed, which are based on HMM and unit concatenation separately. In the unit concatenation synthesis approach, concatenating unit searching procedure is designed and concatenating rule is made. Chinese visual triphone and Chinese dynamic viseme are used as basic unit in training and synthesizing stage separately. Subjective and objective assessment scores of synthesized mouth sequence based on visual triphone are satisfactory. The assessment results prove that the proposed approaches can synthesize smooth, continuous, and satisfactory mouth sequence. After mouth sequence has stitched into background video, a mouth area inpainting approach based on fast marching method is proposed. With the help of painting procedure, a complete, natural and fluent talking-head video is synthesized.
     Based on improved product HMM, a visual speech quality objective assessment approach is proposed. The assessment approach can simulate people’s visual and auditory perception to the speaker and provide objective assessment result. In the assessment process, all the proposed visual speech synthesis approaches are compared. The comparison results prove that the proposed visual speech synthesis technology could highly enhance people’s capability of speech comprehension, especially for the people with impaired hearing.
引文
[1] J. J. Williams. Speech-to-Video Conversion for Individuals with Impaired Hearing [D]. Evanston, Illinois: Northwestern University, 2000.
    [2] Q. Summerfield. Use of Visual Information in Phonetic Perception [J]. Phonetica, 1979, 36: 314–331.
    [3] M. Breeuwer, R.Plomp. Speechreading Supplemented with Formant-Frequency Information for Voiced Speech [J]. Journal of the Acoustical Society of America, 1985, 77 (1): 314-317.
    [4] D. W. Massaro, M. M. Cohen. Perception of Synthesized Audible and Visible Speech [J]. Psychological Science, 1990, 1 (1): 55-63.
    [5] H. McGurk, J. MacDonald. Hearing Lips and Seeing Voices [J]. Nature, 1976, 264: 746-748.
    [6] S. A. King, R.E. Parent. Creating Speech-Synchronized Animation [J]. IEEE Trans. Visualization and Computer Graphics, 2005, 11 (3): 341-352.
    [7] T. Ezzat, G. Geiger, and T. Poggio. Trainable Videorealistic Speech Animation [J]. Proc. SIGGRAPH, 2002, 21 (3): 388-398.
    [8] J. Ma, R. Cole, B. Pellom et al. Accurate Automatic Visible Speech Synthesis of Arbitrary 3D Models Based on Concatenation of Diviseme Motion Capture Data [J]. Computer Animation and Virtual Worlds, 2004, 15 (5): 485-500.
    [9] J. Chai, J. Xiao, and J. Hodgins. Vision-Based Control of 3D Facial Animation [C]// Proc. ACM SIGGRAPH/ Eurographics Symp. Computer Animation (SCA’03), 2003: 193-206.
    [10] Y. R. Pei, H. B. Zha. Transferring of Speech Movements from Video to 3D Face Space [J]. IEEE Transactions on Visualization and Computer Graphics, 2007, 13 (1): 58-69.
    [11] F. Parke. Computer Generated Animation of Face [C]// Proc. ACM Annual Conf., Boston, Massachusetts, USA, 1972: 451-457.
    [12] F. I. Parke. A Parametric Model of Human Faces [D]. University of Utah, 1974.
    [13] M. M. Cohen, D. W. Massaro. Modeling Coarticulation in Synthetic Visual Speech [J]. Models and Techniques in Computer Animation, 1993: 139-156.
    [14] S. Kshirsagar, S. Garchery, and N. M. Thalmann. Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation [C]// Proc. IFIP Conf. on Deformable Avatars, Norwell, MA, USA: Kluwer, 2000: 24–30.
    [15] Y. C. Lee, D. Terzopoulos and K. Waters. Realistic Modeling for Facial Animation [C]// Proc. ACM SIGGRAPH, 1995: 55-62.
    [16] D. Terzopoulos, K. Waters. Physically-Based Facial Modeling, Analysis, and Animation [J]. J. Visualization and Computer Animation, 1990, 1 (4): 73-80.
    [17] D. Terzopoulos, K. Waters. Techniques for Realistic Facial Modeling and Animation [C]// Proc. Computer Animation, Geneva, Switzerland, 1991.
    [18] K. Waters, J. Frisble. A Coordinated Muscle Model for Speech Animation [C]// Proc. Graphics Interface Conf., 1995: 163-170.
    [19] K. K?hler, J. Haber, and H.P. Seidel. Geometry-Based Muscle Modeling for Facial Animation [C]// Proc. Graphics Interface Conf., 2001: 37-46.
    [20] N. M.Thalmann, E. Primeau, and D. Thalmann. Abstract Muscle Action Procedures for Human Face Animation [J]. The Visual Computer, 1988, 3 (5): 290-297.
    [21] J. E. Chadwick, D. R. Haumann, and R. E. Parent. Layered Construction for Deformable Animated Characters [C]// Proc. ACM SIGGRAPH Computer Graphics, Annual Conference Series, 1989: 243-252.
    [22] P. Kalra, A. Mangili, N. M. Thalmann, and D.Thalmann. Simulation of Facial Muscle Actions Based on Rational Free form Deformations [C]// Proc. EUROGRAPHICS, Cambridge, 1992, 11 (3): 59-69.
    [23] R. Osuna, P. K. Kakumanu, A. Esposito et al. Speech-Driven Facial Animation with Realistic Dynamics [J]. IEEE Transactions on Multimedia, 2005, 7 (1): 33-42.
    [24] C. Bregler, M. Covell, and M. Slaney. Video Rewrite: Driving Visual Speech with Audio [C]// Proc. ACM SIGGRAPH, Los Angeles, USA, 1997: 353-360.
    [25] I. A. Ypsilos, A. Hilton, A. Turkmani et al. Speech-Driven Face Synthesis from 3D Video [C]// Proc. 2nd International Symposium on 3D Data Processing, Visualization, and Transmission, 2004: 58-65.
    [26] E. Yamamoto, S. Nakamura, and K. Shikano. Lip Movement Synthesis from Speech Based on Hidden Markov Models [J]. Speech Commun., 1998, 26 (1-2): 105–115.
    [27] F. J. Huang, E. Cossato, H. P. Graf. Triphone Based Unit Selection for Concatenative Visual Speech Synthesis [C]// IEEE International Conference on Acoustics, Speech, and Processing. 2002, 2: 2037-2040.
    [28] J. Kleiser. A Fast, Efficient, Accurate Way to Represent the Human Face: State of the Art in Facial Animation [C]// Proc. ACM SIGGRAPH, Tutorials, Boston, 1989, 22: 20–33.
    [29] C. Kouadio, P. Poulin, and P. Lachapelle. Real Time Facial Animation Based upon a Bank of 3D Facial Expressions [C]// Proc.Computer Animation, 1998:128-136.
    [30] J. Ma, J. Yan, and R. Cole. CU Animate: Tools for Enabling Conversions with Animated Characters[C]// Proc. Int’l Conf. Spoken Language Processing, 2002: 197-200.
    [31] J. Ma, R. Cole. Animating Visible Speech and Facial Expressions [J]. The VisualComputer, 2004, 20 (2-3): 86-105.
    [32] J. P. Lewis, M. Cordner, and N. Fong. Pose space deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation [C]// Proc. ACM SIGGRAPH. Computer Graphics Proceedings, Annual Conference Series, New Orleans, LO, 2000: 165-172.
    [33] F. Pighin, J. Hecker, D. Lischinski et al. Synthesizing Realistic Facial Expressions from Photographs [C]// In Proceedings of ACM SIGGRAPH. Computer Graphics Proceedings, Annual Conference Series, San Antonio, TX, 1998: 75-84.
    [34] W. Zhou and Z. Wang. Speech Animation Based on Chinese Mandarin Triphone Model [C]// 6th IEEE/ACIS International Conference on Computer and Information Science, 2007: 924-929.
    [35] T. Ezzat, T. Poggio. Miketalk: A Talking Facial Display Based on Morphing Visemes [C]// Proc. IEEE Computer Animation, 1998: 96–102.
    [36] T. Ezzat, T. Poggio. Visual Speech Synthesis by Morphing Visemes [J]. International Journal of Computer Vision, 2000, 38 (1): 45-57.
    [37] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin. Real-Time Speech Motion Synthesis from Recorded Motions [C]// Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, 2004: 345-353.
    [38] J. Ma, R. Cole, B. Pellom et al. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data [J]. IEEE Transactions on Visualization and Computer Graphics, 2006, 12 (2): 266-276.
    [39] Z. Deng, U. Neumann, J. P. Lewis, et al. Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces [J]. IEEE Trans. Visualization and Computer Graphics, 2006, 12 (6): 1523–1534.
    [40] K. Waters, T. Levergood. Decface: An Automatic Lipsynchronization Algorithm for Synthetic Faces [R]. Digital Equipment Corporation CRL Report, 1993.
    [41] B. LeGoff, C. Benoit. A Text-to-Audiovisual-Speech Synthesizer for French [C]// Proc. International Conference on Spoken Language Processing, Philadelphia, USA, 1996, 4(3-6): 2163-2166.
    [42] K. Waters, D. Terzopoulos. A Physical Model of Facial Tissue and Muscle Articulation [C]// Proc. First Conf. Visualization in Biomedical Computing, Atlanta, USA, 1990: 77-82.
    [43] D. Terzopoulos, K. Waters. Analysis and synthesis of facial image sequences using physical and anatomical models [J]. IEEE Trans. Pattern Analysis and Machine Intelligence, 1993, 15 (6): 569~579.
    [44] P. Ekman, W. V. Friesen. Facial Action Coding System [M]. Palo Alto, CA: Consulting Psychologist Press, 1978.
    [45] M. A. Sayette, J. Cohn, J. M. Wertz, et al. A Psychometric Evaluation of the Facial Action Coding System for Assessing Spontaneous Expression [J]. Journal of Nonverbal Behavior, 2001, 25 (3):167-186.
    [46] I. S. Pandzic, R. Forchheimer. MPEG-4 Facial Animation: The Standard, Implementation, and Applications [S]. John Wiley and Sons, Inc., 2002.
    [47] P. S. Aleksic, A. K. Katsaggelos. Speech-to-Video Synthesis Using MPEG-4 Compliant Visual Features [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2004, 14 (5): 682-692.
    [48] E. Cosatto, H. P. Graf. Sample-Based Synthesis of Photo-Realistic Talking Heads [C]// Proc. Computer Animation Conf., Philadelphia, USA, 1998: 103-110.
    [49] E. Cosatto, H. P. Graf. Photo-Realistic Talking-Heads from Image Samples [J]. IEEE Trans. Multimedia, 2000, 2 (3): 152–163.
    [50] H. P. Graf, E. Cosatto and T.Ezzat. Face Analysis for the Synthesis of Photo-Realistic Talking Heads [C]// The 4th IEEE Int’l Conf. of Automatic Face and Gesture Recognition, Grenoble, France, 2000: 189-194.
    [51] B. Guenter, C. Grimm, D. Wood, et al. Making Faces [C]// Proc. ACM SIGGRAPH. Computer Graphics Proceedings, Annual Conference Series, 1998: 55-66.
    [52] J. Noh, U. Neumann. Expression Cloningc [C]// Proc. ACM SIGGRAPH. Computer Graphics Proceedings, Annual Conference Series, Los Angeles CA, 2001: 277-288.
    [53] A. L?fqvist. Speech as Audible Gestures [J]. Speech Production and Speech Modeling, Dordrect: Kluwer Academic, 1990: 289-322.
    [54] B. Le Goff. Automatic Modeling of Coarticulation in Text-to-Visual Speech Synthesis [C]// Proc. Eurospeech, 1997, 3: 1667-1670.
    [55] I. Albrecht, J. Haber, and H. P. Seidel. Speech Synchronization for Physics- Based Facial Animation [C]// Proc. Int’l Conf. Computer Graphics, Visualization, and Computer Vision, 2002: 9-16.
    [56] K. Kahler, J. Haber, H. Yamauchi, and H. P. Seidel. Head Shop: Generating Animated Head Models with Anatomical Structure [C]// Proc. ACM SIGGRAPH Symp. Computer Animation, 2002: 55-64.
    [57] E. Cosatto and H. P. Graf. Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads [C]// Proc. Int’l Congress on Math. Education, 2000, 2: 619-622.
    [58] M. Brand. Voice puppetry [C]// Proc. SIGGRAPH’99, Los Angeles, CA, 1999: 21–28.
    [59] D. M. Jiang, I. Ravyse, H. Sahli, and Y.N. Zhang. Accurate visual speech synthesis based on diviseme unit selection and concatenation [C]// IEEE 10thWorkshop on Multimedia Signal Processing, 2008: 906-909.
    [60] E. Cosatto. Sample-Based Talking-Head Synthesis [D]. Lausanne, Switzerland: Swiss Federal Institute of Technology, 2002.
    [61] A. Verma, L. V. Subramaniam, N.Rajput et al. Animating Expressive Faces across Languages [J]. IEEE Transactions on Multimedia, 2004, 6 (6): 791-800.
    [62] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design [J]. IEEE Trans. Commun., 1980, 28 (1): 84–95.
    [63] R. R. Rao, T. Chen, and R. M. Mersereau. Audio-to-Visual Conversion for Multimedia Communication [J]. IEEE Transactions on Industrial Electronics, 1998, 45 (1): 15-22.
    [64] S. Morishima, H. Harashima. A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface [J]. IEEE J Select. Areas Commun., 1991, 9 (4): 594–600.
    [65] F. Lavagetto. Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People [J]. IEEE Trans. Rehabilitation Engineering, 1995, 3 (1): 1–14.
    [66] D. W. Massaro, J. Beskow, M. M. Cohen, et al. Picture My Voice: Audio to Visual Speech Synthesis Using Artificialneural Networks [C]// Proc. Int. Conf. on Auditory-Visual Speech Processing, Santa Cruz, CA, 1999: 133–138.
    [67] P. Hong, Z. Wen, and T. S. Huang. Real-Time Speech-Driven Face Animation with Expressions Using Neural Networks [J]. IEEE Trans. Neural Netw., 2002, 13 (4): 916–927.
    [68] S. Nakamura, E. Yamamoto, and K. Shikano. Speech-to-Lip Movement Synthesis Maximizing Audio-Visual Joint Probability Based on EM Algorithm [C]// IEEE International Workshop on Multimedia Signal Processing, 1998: 53–58.
    [69] S. Nakamura. Fusion of Audio-Visual Information for Integrated Speech Processing [C]// Proc. Third Int. Conf. Audio-and Video-Based Biometri Person Authentication, Halmstad, Sweden, 2001: 127–143.
    [70] J. J. Williams, A. K. Katsaggelos, and M. A. Randolph. A hidden Markov Model Based Visual Speech Synthesizer [C]// Proc. Acoustics, Speech, and Signal Processing (ICASSP), 2000: 2393-2396.
    [71] J.J.Williams, A.K.Katsaggelos, and D.C.Garstecki. Subjective Analysis of an HMM-Based Visual Speech Synthesizer [C]// Proc. Human Vision and Electronic Imaging, 2001, 4299: 544-555.
    [72] J. J. Williams, A. K. Katsaggelos. An HMM-Based Speech-to-Video Synthesizer [J]. IEEE Trans. Neural Networks, 2002, 13 (4): 900–915.
    [73] S. L. Fu, R. Osuna, A. Esposito et al. AudioVisual Mapping with Cross-Modal Hidden Markov Models [J]. IEEE Transactions on Multimedia, 2005, 7 (2):243-252.
    [74] Y. Li, H. Y. Shum. Learning Dynamic Audio-Visual Mapping with Input-Output Hidden Markov Models [J]. IEEE Transactions on Multimedia, 2006, 8 (3): 542-549.
    [75] Z. Deng, J. P. Lewis, and U. Neumann. Practical Eye Movement Model Using Texture Synthesis [C]// Proc. ACM SIGGRAPH Sketches and Applications, San Diego, 2003: 1-1.
    [76] Z. Deng, J. P. Lewis, and U. Neumann. Automated Eye Motion Synthesis Using Texture Synthesis [J]. IEEE Computer Graphics and Applications, 2005, 25 (2): 24-30.
    [77] T. Chen. Audiovisual speech processing: Lip Reading and Lip Synchronization [J]. IEEE Signal Process. Mag., 2001, 18 (1): 9–21.
    [78] K. Choi, J. N. Hwang. Baum-Welch HMM Inversion for Audio-to-Visual Conversion [C]// Proc. IEEE Int. Workshop Multimedia Signal Processing, 1999: 175–180.
    [79] K. Choi, Y. Luo, and J. N. Hwang. Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System [J]. J.VLSI Signal Process., 2001, 29 (1-2): 51–61.
    [80] Y. Bengio, P. Frasconi. Input-Output HMMs for Sequence Processing [J]. IEEE Trans. Neural Netw., 1996, 7 (5): 1231–1249.
    [81] J. Cassell, C. Pelachaud, N. Badler, et al. Animated Conversation: Rule-Based Generation of Facial Expression, Gesture and Spoken Intonation for Multiple Conversational Agents [C]// Proc. ACM SIGGRAPH Conf., 1994: 413-420.
    [82] H. Pyun, Y. Kim, W. Chae, et al. An Example-Based Approach for Facial Expression Cloning [C]// Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, 2003: 167-176.
    [83] E. S. Chuang, H. Deshpande, and C. Bregler. Facial Expression Space Learning [C]// Proc. Pacific Graphics Conf., 2002: 68-76.
    [84] Y. Cao, P. Faloutsos, and F. Pighin. Unsupervised Learning for Speech Motion Editing [C]// Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation, 2003: 225-231.
    [85] Q. Zhang, Z. Liu, B. Guo, B. Guo, et al. Geometry-Driven Photorealistic Facial Expression Synthesis [J]. IEEE Transactions on Visualization and Computer Graphics, 2006, 12 (1): 48-60.
    [86] C. Pelachaud, I. Poggi. Subtleties of Facial Expressions in Embodied Agents [J]. J. Visualization and Computer Animation, 2002, 13 (5): 301-312.
    [87] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating Faces in Images and Video [C]// Proc. Eurographics Conf., Granada, Spain, 2003, 22 (3).
    [88] S. Kshirsagar, T. Molet, and N.M. Thalmann. Principal Components ofExpressive Speech Animation [C]// Proc. Computer Graphics Int’l Conf., 2001: 38-44.
    [89] Z. Deng, J. P. Lewis, and U. Neumann. Synthesizing Speech Animation by Learning Compact Speech Co-articulation Models [C]// Proc. Computer Graphics International, 2005: 19–25.
    [90] H. Zhao, C. J. Tang. Visual Speech Synthesis Based on Chinese Dynamic Visemes [C]// Proceedings of IEEE International Conference on Information and Automation, Zhangjiajie, China, 2008: 139-143.
    [91] N. Otsu. A Threshold Selection Method from Gray2level Histograms [J]. IEEE Transactions on System Man and Cybernetic, 1979, 9(1): 62-66.
    [92] Sue Wu, Adnan Amin. Automatic Thresholding of Gray Level Using Multi-stage Approach [C]// Proc. IEEE International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, Scotland, 2003: 1238-1242.
    [93] A. D. Brink. Thresholding of Digital Image Using Two-dimensional Entropies [J]. Pattern Recognition (S0031-3203), 1992, 25 (8): 803 - 808.
    [94] J. Kittler, J. Illingworth. On Threshold Selection Using Clustering Criteria [J]. IEEE Transactions on System Man and Cybernetic (S0018-9472), 1985, 15 (5): 652-655.
    [95]刘健庄,涂予清.使用高效的c均值聚类算法的图像阈值化方法[J].电子科学学刊, 1992, 14 (4): 424-427.
    [96]刘健庄.基于二维直方图的图像模糊聚类分割方法[J].电子学报, 1992, 9 (20): 40-46.
    [97]杨润玲,高新波.基于加权模糊c均值聚类的快速图像自动分割算法[J].中国图象图形学报, 2007, 12 (12): 2105-2112.
    [98]高新波,李洁,姬红兵.基于加权模糊c均值聚类与统计检验指导的多阈值图像自动分割算法[J].电子学报, 2004, 32 (4): 661- 664.
    [99] N. R. Pal, J. C. Bezdek. On Cluster Validity for the Fuzzy C-means Model [J]. IEEE Transactions on Fuzzy Systems, 1995, 3 (3): 370-379.
    [100]高新波,裴继红,谢维信.基于统计检验指导的聚类分析方法[J].电子科学学刊, 2000, 26 (1): 6-12.
    [101] R. L. Hsu, M. Abdel-Mottaleb, and A. Jain. Face Detection in Color Images [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24 (5): 696-706.
    [102]余锦华,汪源源,施心陵.基于空间邻域信息的二维模糊聚类图像分割[J].光电工程, 2007, 34 (4): 114-119.
    [103]徐彦君,杜利民.汉语听觉视觉双模态数据库CAVSR1.0[J].声学学报, 2000, 25 (1): 42-49.
    [104]洪晓鹏,姚鸿勋,徐铭辉.基于句子级的唇读语料库及其切分算法[J].计算机工程与应用, 2005, (3): 174-177.
    [105]何婷婷.语料库研究[D].华中师范大学, 2003.
    [106] G. Leech. Corpus Annotation Schemes [J]. Literary and Linguistic Computing, 1993, 8 (4): 275-281.
    [107]郭曙纶.国家语料库建设和汉语词表研究[D].上海交通大学. 2003.
    [108]孙吉贵,刘杰,赵连雨.聚类算法研究[J].软件学报, 2008, 19 (1): 48-61.
    [109]王志明,蔡莲红,艾海舟.基于数据驱动方法的汉语文本-可视语音合成[J].软件学报, 2005, 16 (6): 1054-1063.
    [110]吴华,徐波,黄泰翼.基于三音子模型的语料自动选取算法[J].软件学报. 2000, 11 (2): 271-276.
    [111]康恒,刘文举.基于综合因素的汉语连续语音库语料自动选取[J].中文信息学报, 2003, 17 (4): 27-32.
    [112]祖漪清.汉语连续语音数据库的语料设计[J].声学学报, 1999, (3): 236-247.
    [113]俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范[J].中文信息学报, 2002, 16 (5): 49-64.
    [114]俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范(续)[J].中文信息学报, 2002, 16 (6): 58-65.
    [115] S. Moon, J. N. Hwang. Robust Speech Recognition Based on Joint Model and Feature Space Optimization of Hidden Markov Models [J]. IEEE Trans. Neural Networks, 1997, 8 (2): 194–204.
    [116] M. Tamura, T. Masuko, T. Kobayashi, and K. Tokuda. Visual Speech Synthesis Based on Parameter Generation from HMM: Speech-Driven and Text-and- Speech-Driven Approaches [C]// Proc. Auditory-Visual Speech Processing 1998.
    [117] Woei-Luen Perng, Yungkang Wu, Ming Ouhyoung. Image Talk: A Real Time Synthetic Talking Head Using One Single Image with Chinese Text-to-Speech Capability [C]// Sixth Pacific Conference on Computer Graphics and Applications, 1998: 140-148.
    [118]王志明,蔡莲红,吴志勇.汉语文本-可视语音转换的研究[J].小型微型计算机系统, 2002, 23 (4): 474-477.
    [119] L. R. Rabiner, B. H. Juang. Fundamentals of Speech Recognition [M]. Prentice Hall, 1993.
    [120] T. Masuko, T. Kobayashi, M.Tamura, et al. Text-to-Visual Speech Synthesis Based on Parameter Generation from HMM [C]// Proc. ICASSP, Seattle, 1998: 3745–3748.
    [121] J. A. Sethian. A Fast Marching Level Set Method for Monotonically AdvancingFronts [C]// Proc. Nat. Acad. Sci., 1996, 93 (4): 1591-1595.
    [122] J. A. Sethian. Level Set Methods and Fast Marching Methods [M]. Second edition. Cambridge, UK: Cambridge Univ. Press, 1999.
    [123] R. Tsai, S. osher. Level Set Methods And Their Applications In Image Science [J]. Commun. Math. Sci., 2003, 1 (4): 623-656.
    [124] S. Kim. An O(N) Level Set Method For Eikonal Equations [J]. SIAM J.SCI.COMPUT. 2001, 22 (6): 2178-2193.
    [125] J. Deng, M. Bouchard, and T. H. Yeap. Feature Enhancement for Noisy Speech Recognition with a Time-Variant Linear Predictive HMM Structure [J]. IEEE Trans. Audio, Speech, and Language Processing, 2008, 16 (5): 891-899.
    [126] J. Li, L. Deng, D. Yu, and J. Wu. Adaptation of Compressed HMM Parameters for Resource-Constrained Speech Recognition [C]// Proc. ICASSP, 2008: 4333-4336.
    [127] X. Cui, A. Alwan. Noise Robust Speech Recognition Using Feature Compensation Based on Polynomial Regression of Utterance SNR [J]. IEEE Trans. Speech and Audio Proc., 2005, 13 (6): 1161-1172.
    [128] H. W. Frowein, G. F. Smoorenburg, L. Pyters, and D. Schinkel. Improved Speech Recognition Through Videotelephony: Experiments with the Hard of Hearing [J]. IEEE J. Select. Areas Commun., 1991, 9: 611-616.
    [129] K. Kumatani, S. Nakamura, and K. Shikano. An Adaptive Integration Based on Product HMM for Audio-Visual Speech Recogniton [C]// Proc. ICME, 2001: 1020-1023.
    [130] G. Potamianos, H. P. Graf. Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition [C]// Proc. ICASSP, 1998, 6: 3733-3736.
    [131] M. J. Tomlinson, M. J. Russell and N. M. Brooke. Integrating Audio and Visual Information to Provide Highly Robust Speech Recognition [C]// Proc. ICASSP, 1996, 2: 821-824.
    [132] C. Miyajima, K. Tokuda and T. Kitamura. Audio-Visual Speech Recognition Using MCE-based HMMs and Model-dependent Stream Weights [C]// Proc. ICSLP, 2000, 2: 1023-1026.
    [133]谢磊,蒋冬梅, I.Ravyse等.双模型语音识别中的听视觉合成和模型同步异步性实验研究[J].西北工业大学学报, 2004, 22 (2): 171-175.
    [134] S. Dupont, J. Luettin. Audio-Visual Speech Modeling for Continuous Speech Recognition [J]. IEEE Transactions on Mutimedia, 2000, 2 (3): 141-151.
    [135] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition [J]. Proceedings of the IEEE, 1989, 77(2): 257 - 286.
    [136] C. Bregler, S. M. Omohundro. Nonlinear Manifold Learning for Visual Speech Recognition [C]// IEEE Int. Conf. Computer Vision, Piscataway, NJ, 1995:494–499.
    [137] J. S. Lee, C. H. Park. Robust Audio-Visual Speech Recognition Based on Late Integration [J]. IEEE Transactions on Multimedia, 2008, 10 (5): 767-779.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700