汉语TTS中的韵律建模与合成方法研究

英文题名：A Research of Prosody Modeling and Synthesis Method in Chinese TTS
作者：贺培刚
论文级别：硕士
学科专业名称：电路与系统
中文关键词：语音合成 ; 人工神经网络 ; 韵律建模 ; 频谱修正
英文关键词：Speech Synthesis ; ANN ; Prosody Modeling ; Spectral Modification
学位年度：2008
导师：蒋保臣
学科代码：080902
学位授予单位：山东大学
论文提交日期：2008-03-09

摘要

随着计算机技术的进步和其他相关学科的发展,在过去的几十年间,语音合成技术有了迅猛的发展,涌现出了大量的新理论和新技术。在现阶段,语音合成技术主要是以文语转换系统(Text-To-Speech,TTS)为研究重点。这是一种将输入的文本转换为语音输出的技术。TTS系统一般由文本分析、韵律控制、语音合成和基元库四个模块组成。这四个模块并不是相互孤立的,每一个模块的性能都对最终输出语音的质量有很大的影响。
     对合成系统输出语音音质的评价是多方面的,但主要集中在输出语音的清晰度,可懂度和自然度这三个方面。当前,TTS系统的输出语音在清晰度和可懂度方面已经达到了比较高的水平,而在语音的整体自然度方面还有待提高。本文主要研究了韵律控制和语音合成这两个模块,希望通过对这两个模块的研究与改进来提高合成语音的自然度。
     韵律控制模块对合成语音的自然度有很大的影响,对这部分的研究包括多个方面,这里选择韵律建模作为研究重点。韵律模型可以将一些定性的高层韵律信息转换为定量的声学参数,以便提供给后面的语音合成模块使用。本文运用人工神经网络技术设计并实现了一个用于预测汉语音节基频曲线、时长和停顿的模型。实验表明,这个模型在一定程度上能够较好的反映汉语陈述句中音节的基频曲线、时长和停顿的变化情况。
     语音合成模块负责最终语音的输出,现在普遍采用波形拼接技术。在选择最优的合成基元序列的同时,此模块也需要对其中一些语音波形做适当的修正,以使合成语音听上去更加流畅自然。本文在研究了最优基元选择算法的同时,也研究了一种基于傅立叶变换的语音频谱平滑算法。此算法能够较好的进行语音频谱平滑并且在一定程度上避免了传统算法使合成语音质量有较大下降的问题。
     为综合验证算法的性能,本文构建了一个简易的TTS系统,在其中采用了上面介绍的算法。听音测试表明,此系统的合成语音自然度较高。
During the past few decades, with the development of computer and other related subjects, the speech synthesis technique progressed a lot. Nowadays, speech synthesis technique focuses on Text-To-Speech (TTS). TTS is a technique that can convert the input text into speech output. Generally speaking, a TTS system consists of four modules, including Text Analysis, Prosody Control, Speech Synthesis and Unit Database. However, the four modules are not independent. The quality of output speech is impacted greatly by every single module.
     The estimation to output speech relates to many aspects, but mainly to definition, understandability and naturalness. The definition and understandability of existing TTS systems are satisfactory now, but the overall naturalness still need to be improved. In this thesis, we research Prosody Control and Speech Synthesis these two modules to improve the output speech naturalness.
     The Prosody Control module greatly impacts the naturalness of the output speech. There are many research subjects in Prosody Control, but we focus on prosody modeling. Prosody model is used to predict the quantitive acoustics parameters according to the high level qualitative prosody information. We design and implement a predictor, which can predict the pitch contour, duration and pause of Chinese syllable. Experiment result shows that this model is accurate enough to predict these parameters.
     The speech synthesis module builds the final output speech, and generally adopts the waveform concatenation technique. After the selection of optimal units, it also does some modification to the waveform to make the speech more natural. In this paper, an optimal unit selection algorithm and a Fourier based speech spectral modification algorithm are introduced in detail. This modification algorithm not only smoothes the speech spectrum, but also avoid the problem of synthesized speech quality degrading which is caused by traditional algorithm.
     To verify the performance of algorithms, a simple TTS system is constructed in this paper, which utilizes all the mentioned algorithms. The listening test indicates that the output speech is more natural than previous system to some extent.

引文

[1]Juang,B.H.,Tsuhan Chen,The past,present,and future of speech processing Signal Processing Magazine,IEEE Volume 15,Issue 3,May 1998 Page(s):24-48
    [2]贺琳,初敏,吕士楠,钱瑶,冯勇强,汉语合成语料库的韵律层级标注研究,新世纪的现代语音学一第五届全国语音学学术会议,北京:清华大学出版社,323-326,2001.9
    [3]林茂灿,普通话孤立句的韵律结构和FO下倾,新世纪的现代语音学-第五届全国现代语音学学术会议论文集,2001.9
    [4]沈炯,汉语语势重音的音理,语文研究,1994,第3期
    [5]吴宗济,普通话三字组变调规律,中国语言学报,1985,第2期
    [6]林茂灿,普通话自然话语中的下倾,当代语言学,2002,第4期
    [7]http://www.ling.ohio-state,edu/~tobi/
    [8]Walker,M.R.,harson,J.,Hunt,A.,Anew W3C markup standard for text-to-speech synthesis,Acoustics,Speech,and Signal Processing,2001.Proceedings.(ICASSP '01).2001 IEEE International Conference on Volume 2,7-11 May 2001 Page(s):965- 968 vol.2
    [9]Tien-YingFung,Yuk-Chi Li,Meng,H.,Ching,P.C.,Prosody and style controls in CU VOCAL using SSML and SAPI XML tags,Chinese Spoken Language Processing,2004 International Symposium on 15-18 Dec.2004 Page(s):209- 212
    [10]清华大学计算机科学与技术系,语音合成语料库TH-COSS技术报告2003.12
    [11]Charpentier,F,Stella,M.,Diphone synthesis using an overlap-add technique for speech waveforms concatenation,Acoustics,Speech,and Signal Processing,IEEE International Conference on ICASSP '86.Volume 11,Apr 1986 Page(s):2015-2018
    [12]微软木兰TTS在线演示,https://research.microsoft.com/speech/tts.asp
    [13]AT&T的Natural Voice在线演示,http://www.naturalvoices.att.com/demos/
    [14]捷通华声产品在线演示,http://www.sinovoice.com.cn/2-e-1.asp
    [15]科大讯飞产品在线演示,http://www.iflytek.com/speechshow.asp
    [16]吴宗济,赵元任先生在汉语声调研究上的贡献,清华大学学报(哲学社会科学版),第11卷,第3期,1996
    [17]吴宗济,吴宗济语言学论文集,商务印书馆,2004
    [18]林茂灿,汉语语调与声调,语言文字应用,2004年8月,第3期
    [19]冯勇强,初敏,贺琳,吕士楠,汉语话音音节时长统计分析,第五届全国现代语音学会议,2001
    [20]林茂灿,普通话语句中间断和语句韵律短语,当代语言学,第2卷2000年第4期210-217页
    [21]Klatt,D,The klattalk text-to-speech conversion system,Acoustics,Speech,and Signal Processing,IEEE International Conference on ICASSP '82.Volume 7,May 1982 Page(s):1589-1592
    [22]Simon Haykin,Neural Networks:A Comprehensive Foundation,Tsinghua University Press,2001,Beijing
    [23]Mehmed Kantardzic,数据挖掘,清华大学出版社,2003
    [24]Gerard Bailly,Bleicke Holm,SFC:A trainable prosodic model,Speech Communication 46(2005)348-364
    [25]Wentao Gu,Hirose,K.,Fujisaki,H.,A method for automatic extraction of F/sub O/ contour generation process model parameters for Mandarin,Automatic Speech Recognition and Understanding,2003.ASRU '03.2003 IEEE Workshop on 30 Nov.-3 Dec.2003 Page(s):682-687
    [26]Yi Xu,Q.Emily Wang,Pitch targets and their realization:Evidence from mandarin Chinese,Speech Communication,Vol.33,pp.319-337,2001
    [27]陶建华,蔡莲红,赵世霞,吴志勇,汉语文语转换系统中可训练韵律模型的研究,声学学报,Vol.26,No.1,Jan.2001
    [28]Chen,S.H.and Y.R.Wang,Vector Quantization of Pitch Information in Mandarin Speech,IEEE trans.Communications,Vol.38,No.9,pp.1317-1320,1990.
    [29]Alain de Cheveigne,Hideki Kawahara,YIN,a fundamental frequency estimator for speech and music,Journal of the Acoustical Society of America,v 111,n 4,2002,p 1917-1930
    [30]熊子瑜,韵律单元边界特征的声学语音学研究,语言文字应用,2003年5月,第2期
    [31]杨玉芳,句法边界的韵律学表现,声学学报,第22卷第5期,1997年9月
    [32]Chu,M.,Feng,Y.,Study on Factors Influencing Durations of Syllables in Mandarin,Eurospeech2001
    [33]Charpentier.F.,Stella.M.,Diphone synthesis using an overlap-add technique for speech waveforms concatenation,Acoustics,Speech,and Signal Processing,IEEE International Conference on ICASSP '86.Volume 11,Apt 1986 Page(s):2015-2018
    [34]Touters,J,Macon,M.W.Spectral modification for concatenative speech synthesis.Acoustics,Speech,and Signal Processing,2000.ICASSP '00.Proceedings.2000 IEEE International Conference on.Volume 2,5-9 June 2000Page(s):Ⅱ941-Ⅱ944 vol.2
    [35]Thomas F.Quatieri.Discrete-Time Speech Signal Processing:Principles and Practice(汉译本).电子工业出版社,2004.
    [36]Pfitzinger,Hartmut R.DFW-based Spectral Smoothing for Concatenative Speech Synthesis.Proc.ICSLP 2004,Korea,2004:1397-1400.
    [37]Heng Kang,Wenju Liu.Sinusoidal + all-pole modification based spectral smoothing for concatenative speech synthesis.Natural Language Processing and Knowledge Engineering,2005.IEEE NLP-KE' 05.
    [38]Johan Wouters,Michael W Macon.Control of spectral dynamics in concatenative speech synthesis.IEEE Trans.on Speech and Audio Processing,2001,9(1):30-38.
    [39]韩纪庆,张磊,郑铁然,语音信号处理,清华大学出版社,2004
    [40]蔡莲红,黄德志,蔡锐,现代语音技术基础与应用,清华大学出版社,2003
    [41]陈怀琛,吴大正,高西全,Matlab及在电子信息课程中的应用,电子工业出版社,2002
    [42]Min Chu,Hu Peng,Yong Zhao,Zhengyu Niu,Chang,E.,Microsoft Mulan - a bilingual TTS system,Acoustics,Speech,and Signal Processing,2003.Proceedings.(ICASSP '03).2003 IEEE International Conference on Volume 1,6-10 April 2003Page(s):Ⅰ-264-Ⅰ-267 vol.1

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700