基于统计模型的韵律结构预测研究

英文题名：Research on Prosodic Structure Prediction Based on Statical Model
作者：包森成
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：韵律结构预测 ; 条件随机场 ; 最大熵 ; 特征模板
英文关键词：Prosodic structure predition ; Conditional Random Fields ; Maximum Entropy Model ; Feature template
学位年度：2009
导师：董远
学科代码：081002
学位授予单位：北京邮电大学
论文提交日期：2009-02-15

摘要

随着计算机技术的进步和其他相关学科的发展,在过去的几十年间,语音合成技术有了迅猛的发展,涌现出了大量的新理论和新技术。在现阶段,语音合成技术主要是以文语转换系统(Text To Speech,TTS)为研究重点,这是一种将输入的文本转换为语音输出的技术。TTS系统一般由文本分析、韵律处理、语音合成三个模块组成。这三个模块并不是相互孤立的,每一个模块的性能都对最终输出语音的质量有很大的影响。
     对合成系统输出语音音质的评价是多方面的,但主要集中在输出语音的可懂度和自然度两个方面。当前,TTS系统的输出语音在可懂度方面已经达到了比较高的水平,而在语音的整体自然度方面还有待提高,其根本问题就是不能对自然语流中的韵律进行有效的模拟。韵律处理的研究主要有以下几个方面:韵律预测,韵律规则,韵律描述和韵律建模。本文主要研究了韵律结构预测模板,希望通过对此模块的研究与改进来提高合成语音的自然度。
     韵律预测与文本分析之间有着紧密的联系,这是因为TTS系统的输入是无限制的文本,从文本中只确定读音是远远不够的。为了提高语音的自然度,还需要从文本中提取更多的与韵律相关的信息,其中包括文本的韵律结构、重音和语调等信息。研究表明,在TTS系统中引入韵律层级结构可以显著提高合成语音的质量,特别是合成语音的自然度。如何提高韵律结构预测的正确率是本文研究的重点。
     本文从汉语的声学特点和韵律特征出发,分析和研究了汉语的韵律特征、停顿、重音以及韵律边界之间的关系,分析并对比了汉语韵律层级结构,同时分析了韵律边界的声学特征。对传统的韵律结构预测的方法进行了综述和比较,指出传统韵律结构预测方法的优缺点,然后重点研究了基于统计机器学习的韵律结构预测,特别是条件随机场(CRFs)和最大熵(ME)模型在韵律结构预测中的应用。
     在基于条件随机场的韵律结构预测系统的研究中,理论上,本文详细阐述了条件随机场的定义,条件分布以及参数估计。在应用上,本文重点研究了条件随机场的特征模板,并讨论了窗长的选取,复合特征的作用等问题。
     在基于最大熵模型的韵律结构预测系统的研究中,在理论上,本文详细阐述了最大熵模型模型的定义,条件分布以及参数估计。在应用上,本文重点研究了最大熵模型的特征模板,并讨论了窗长选取和动态特征的作用等问题。此外,本文提出了基于最大熵模型的多遍韵律结构预测系统,并和基于CRFs的预测系统进行了性能上比较和分析。在韵律短语预测上,前者的性能好于后者。
During the past few decades, with the development of computer and other reiated subjects, the speeeh synthesis technique progressed a lot. TTS is a technique that ean convert the input text to speeeh output. generally speaking, a TTS system consists of three modules, including text analysis, prosody processing, speeeh synthesis.However, the three modules are not independent. The quality of output speeeh is impactedg reatly by every single module.
     We can evaluate the output speech in many aspects, but mainly in the output speech intelligibility and naturalness. At present, the intelligibility of TTS has reached a high level, but the naturalness still needs to be improved. There are for areas in prosodic treatment research: prosody prediction, prosody rules, prosody description and prosody modeling. This paper mainly studied the prosodic structure prediction; hope to improve the module to improve the naturalness of synthesized speech.
     There are close relaition between prosody predictions a text analysis. It is far from sufficient to determine the pronunciation from the text, because the importation of TTS systems is unlimited text. In order to improve the naturalness of speech, it is necessary to extract more prosody information from the text, including the prosodic structure, accent and intonation information. Studies have shown that the prosodic structure can significantly improve the quality of synthesized speech, especially the naturalness of synthesized speech. This paper focuses on how to improve the prosodic structure prediction.
     This paper analyzed the relationship amony the Chinese prosodic features, pause, accent, as well as the rprosodic boundary, analyzed and compared the Chinese Prosodic hierarchy, while the acoustic characteristics of prosodic boundary. The paper reviewd and compared the traditional Prosodic structure prediction methods, pointed out that the the advantages and disadvantages of traditional prosodic structure prediction methods, and then focused on statistical machine learning based prosodic structure prediction, especially CRF and ME model.
     In the study of CRFs based prosodic structure prediction system, the paper described the CRFs definition and parameter estimation. And this paper focused on the feature template of CRFs, discussed the selection of the feature window and combined features.
     In the study of Maximum entropy-based prosodic structure prediction system, this article described the ME definition and parameter estimation. Then it focused on the feature template of maximum entropy model, and discussed the selection of feature window and dynamic features. In addition, this paper, came up with maximum entropy based multi-pass prosodic structure prediction system, and compared with the CRFs-based prediction system. In the prosodic phrase prediction, the former's performance is better than the latter.

引文

[1]张家腻.论语音技术的发展.声学学报.2004,29(3):193-199
    [2]James Allen著.刘群等译.自然语言理解.第二版.北京:电子工业出版社.2005:1-14,31-61,200-205
    [3]韩纪庆,张磊,郑轶然.语音信号处理.北京:清华大学出版社.2004:1-10,160-189
    [4]蔡莲红,黄德智,蔡锐.现代语音技术基础及应用.北京:清华大学出版社.2003:166-230
    [5]K.Panchapagesan,Partha PratimTalukdar,N.Sridhar Krishna.et al.Hindi Text Normalization[C]//Proc.KBCS 2004.2004:19-22.
    [6]M.H.Moattar,M.M.Homayounpour,D.Zabihza-deh.Persian Text Normalization Using Classification Tree and Support Vector Machine[C]//Proc.ICTTA 2006.2006:1308-1311.
    [7]贾玉祥,黄德智,刘武等.中文语音合成中的文本正则化研究.中文信息学报.2008
    [8]贺琳,初敏,吕士楠等.汉语合成语料库的韵律层级标注研究.新世纪的现代语音学一第五届全国语音学学术会议.北京:清华大学出版社.2001,323-326.
    [9]林茂灿.普通话孤立句的韵律结构和F0下倾.新世纪的现代语音学一第五届全国现代语音学学术会议论文集.2001.
    [10]沈炯.汉语语势重音的音理.语文研究.1994.第3期
    [11]吴宗济.普通话三字组变调规律.中国语言学报.1985.第2期
    [12]林茂灿.普通话自然话语中的下倾.当代语言学.2002.第4期
    [13]http://www.ling.ohio-state.edu/tobi/
    [14]Walker,M.R..Larson,J.Hunt.A new W3C markup standard for text-to-speech synthesis.Aeoustics,Proeessing,2001.Proeeedings.(ICASSP Speeeh,01).2001 IEEE International Conference on Volume 2,7-11 May 2001 Page(s):965-968 vol.2
    [15]薛健,蔡莲红.一种基于声调规范模型的声调变换方法.计算机工程与应用.2005,10:40-85
    [16]王玮,蔡莲红.基于数据挖掘算法的汉语合成韵律参数预测方法.声学学报.2003, 28(1):1-6
    [17]陶建华,蔡莲红.基于音节韵律特征分类的汉语语音合成韵律模型的研究.声学学报.2003,28(s):395-402
    [18]赵晨,陶建华,蔡莲红.基于规则学习的韵律结构预测.中文信息学报.2002,16(5):30-37
    [19]陶建华,蔡莲红.汉语文语转换系统中可训练韵律模型的研究.声学学报.2001,26(1):67-72
    [20]王蓓,杨玉芳,吕士楠.汉语韵律层级结构边界的声学分析.声学学报.2004,29(1):29-36
    [21]许洁萍,初敏,贺琳等.汉语语句重音对音高和音长的影响.声学学报.2000,25(4):335-339
    [22]高明明.汉语语句强调重音韵律特征的实验研究.北京大学中文系博士论文.2004
    [23]沈炯,J.H.v.d.Hoek.汉语语势重音的音理(简要报告).语文研究.1994,(3):10-15
    [24]沈炯.汉语语调构造和语调类型.方言.1994,(3):221-228
    [25]王安红,陈明,吕士楠.基于言语数据库的汉语音高下倾现象研究.声学学报.2004,29:353-358
    [26]Bachenko J,Fitzpatrick E.A computational grammar of discourse-neutral prosodic phrasing in English[J].Computational Linguistics,1990,16(3):155-170.
    [27]J.Nrschberg,P.Prieto.Training intonational phrasing phrasing rules automatically for English and Spanish text-to-speech[J].Speech Communication,1996.
    [28]M.Chu,Y.Qian,Iocating Boundaries for Prosodic Constituents in Unrestricted linguistics and Chinese Language Processing,February 2001,6(1):61-82.
    [29]E.Brill,A rule-based approach to prepositional phrase attachment disambiguation,Proc.15th international conference on computational linguistics,1994,pp 1198-1204
    [30]E.Brill,Automatic grammar induction and parsing free text:a transformation-based approach.Proc.Of the ARPA human language technology workshop,Princeton,N.J.1993
    [31]聂鑫,王作英.汉语语句中短语间停顿的自动预测方法[J].中文信息学报.2003,17(4):39-44.
    [32]Yao Qian,Min Chu,Hu Peng.Segmenting unrestricted Chinese text into prosodic words instead of lexical words,ICASSP2001
    [33]牛振雨,柴佩琪.基于边界点词性特征统计的韵律短语切分.中文信息处理学报.2001,15(5)
    [34]Zhiwei Ying and Xiaohua Shi.An RNN-based algorithm to detect prosodic phrase for Chinese TTS,ICASSP2001
    [35]J.Lafferty,A.McCallum and F.Pereira.2001.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In Proceedings of ICML 2001.
    [36]J.N.Darroch,D.Ratcliff.Generalized iterative scaling for log-linear models.The annals of Mathematical Statistics,43,1470-1480.
    [37]L.Berger.The improved iterative scaling algorithm:A gentle introduction.School of computer science,Carnegie Mellon University,1997.
    [38]Robert Malouf.2002.A comparison of algorithms for maximum entropy parameter estimation.Sixth Workshop on Computational Language Learning(CoNLL-2002).
    [39]L.Berger,S.A.Della Pietra and V.J.Della Pietra.A maximum entropy approach to natural language processing.Computational Linguistics,1996,22(1),39-71.
    [40]Adwait Ratnaparkhi.A maximum entropy part-of-speech tagger.In proceedings of the conference of empirical methods in natural language processing,may 17-18,University of Pennsylvania,1996.
    [41]Rob Koeling.Chunking with Maximum entropy models.In proceedings of CoNLL-2000and LLL-2000,Lisbon,Portugal,2000,139-141.
    [42]K.Nigam,L.Lafferty and A.McCallum.Using Maximum entropy for text classification.In IJCAI-99 workshop on machine learning for information filtering,1999.
    [43]J.Kazama and J.Tsujii.2005.Maximum entropy models with inequality constraints:a case study on text categorization.Machine Learning,60:159-194.
    [44]Z Sheng,T Jianhua,C Lianhong:Learning rules for Chinese prosodic phrase prediction.International Conference on Computational Linguistics,Proceeding of the first SIGHAN workshop on Chinese language processing,Vol.18.(2002)
    [45]Jian-Feng Li,Guo-Ping Hu,Renhua Wang:Chinese prosody phrase break prediction based on maximum entropy model,In:Interspeech 2004,Jeju Island,Korea.(2004)729-732

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700