基于二元语义标注的波形拼接语音合成
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
语音合成(TTS,Text To Speech)技术是将计算机自己产生的或外部输入的文字信息,比如文本文件内容、WORD文件内容等文字信息,按语音处理规则转换成语音信号输出,即使计算机流利地读出文字信息,使人们通过“听”就可以明白信息的内容。随着计算机技术和通讯技术的巨大发展,TTS技术已经应用到语音对话系统、语音呼叫中心、语音触发的网站和电子邮件服务等很多领域并且已经发挥出其巨大的威力,但是,当前现有的TTS系统在自然度和可懂度方面都离人们的要求相差甚远,真正能够代替人来阅读的TTS系统还没有出现,从而也制约着TTS系统在更大的范围内的使用。
     在语音合成方面,首先遇到的困难是从文本信息到韵律的标识上,自然语言中,语音特征变化万千,其数据本身隐含了知识。而对这些知识,人类可以感知,但对其的认识、描述是远远不够的。在从文字到韵律符号描述的自动转换方面,对自然语音理解能力的不足一直是研究工作的瓶颈所在。目前文字到韵律描述的转换通常只能根据一些基本的语法信息(如词性)来划分语调短语或设置语句的普通重音,还没有根据句子的语义来做深层次处理(如设置不同的表达或感情色彩)的能力。其次,从声学的层面上,人们对韵律特征对应的声学参数还没有完全认识,缺乏完备的描述,只能凭经验。这也进一步阻碍了将文本标注的韵律信息表现出来,生成自然的带有韵律感和重音感的合成语音。
     本文借助我们实验室以往对自然语言理解处理的成果——二元语义关系分析。建立了一套符合XML扩展标记语言标准的文本语音合成描述符号体系,同时建立了从语义描述标注到语音合成韵律标注的转换规则,将对语义的描述自动转换到语音韵律信息的描述。而且,还考虑到了文本中的多音字、数字、符号、字母的发音问题,建立了一系列针对这些情况的发音描述方式。
     在韵律语音的合成上,本文搜集了1248个汉语中的单字和8000多个使用频率较高的双字词、三字词、四字词以及常用人名、地名等语料信息,对其进行整理编号后,在转门为本系统开发的语音库维护程序上对这些语料进行了人工录音,对这些语音资料切分和基音周期分析后,存入语音数据库和检索索引数据库,构建了本系统所需要的基础语音数据。
     语音合成模块包含语速修改单元、语气修改单元、重音修改单元以及静音生成单元等,并且把它们做成模块的形式,提供接口供语音合成模块调用以改变语
TTS (Text To Speech) technology is a kind of technology that can translate the text information (the computer itself generated or input by other people), for example, a text file or a word document into the speech information. In a word, we want to let the computer read the text information fluently so that the people can understand the information only by listening. With the great development of computer technology and communication technology, TTS technology have applied to Speech dialog system, Call center system, Voice web pages and Voice email system, etc., and have a significant effect on application. However, all the TTS system now people used are suffered from the natural and understanding, and no TTS system can really read the text for people, so all these disadvantages make the TTS only can be used in limited fields.The first difficulty is the tagging of Prosodic information. In natural language, speech characters are protean and these characters connote a lot of knowledge. The people can feel the knowledge but cannot describe them. In the fields of automatically translating the words into prosodic markup, the limited understanding of natural language is the bottleneck of research work. Now, the translating of words into prosodic describe can only depend on these basic information such as syntax information (part-of-speech) to partition tone phrase or set the stress of a sentence, yet can not process deeply according to the semantic. And secondly, in the parts of acoustics, people are not fully able to know the parameters. Meanwhile, they are shot of elegant describe and people understand them only by the experiences. Therefore, all these limitations embarrass the development of information represented.In this paper, we depend on the development of natural language at our lab- binary relations syntax analysis and set up a set of marks according to the XML to markup the text which will be translated into voice, and at the same time we set up a set of regulars in order to transfer the semantic description into prosodic description. Meanwhile, we also considered the multi sounds words, numbers, symbols and characters, and set up serials
    of description manners for this condition.In prosodic speech synthesizing, we collected 1248 Chinese single characters and more than 8000 often used Chinese phrases, including double character phrase, three character phrase, four character phrase and famous names of people and places. After analyzing and tagging, we record all of them on our speech database maintenance program by people, and after cutting and marking pitch, we save them into speech database and index database, thus, we get all the base speech data of our TTS system.Speech synthesizing module contains speech speed edit unit, speech mode edit unit, stress edit unit and silent generator unit, etc. All the units are in module form, and they can offer interface.In this speech synthesizing system, firstly, we set up prosodic marks based on the deep understanding of natural language and transform the semanteme markup to prosodic markup based on binary relations syntax analysis, therefore, this kind of markup is more advanced and can approach real prosodic purposes of human people. In synthesizing procedure, based on PSOLA algorithm and extensive speech database, we implement an easy voice prosodic control which makes the synthesized speech clearly and naturally and makes a great progress in understanding and naturalness.Next work in this paper included: deep research in semanteme markup and prosodic research, and to transfer more semanteme information into prosodic information; to set up a more extensive speech database, so that the language materials can contain not only sentences but also paragraphs of text; to create more prosodic control units in order to control not only prosodic in sentences but also between sentences and paragraphs.
引文
[1] E. Moulines, F. Charpentier. Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones, Speech Communication 9, 1990, P453-467.
    [2] 杨玉芳。语句韵律结构直觉,声学学报(中文版),1998,02。
    [3] 吴宗济。《用于普通话语音合成的〈韵律标记文本〉的设计》,《第三届全国语音学研讨会论文集》,1996。
    [4] 初敏,吕士楠。一种高清晰度、高自然度的汉语文语转换系统,声学学报,1996年8月,第21卷第4期增刊。
    [5] 沈炯《关于韵律和语调的一些看法》,《第三届全国语音学研讨会论文集》,1996。
    [6] Wu, Zongji, Rules of intonation in Standard Chinese, Preprints of papers for the working group on intonation, 1982.
    [7] 沈炯《北京话声调的音域和语调》,载《北京语音实验录》,北京大学出版社,1985。
    [8] 万建成。汉语的二元语义模型[C]。陈力伟,袁琪。计算机语言学进展与应用。
    [9] 岳东剑,柴佩琪,宣国荣。面向文语转换的标记语言标准的研究,计算机辅助工程,No.2,Jun,2000。
    [10] 岳东剑,柴佩琪,宣国荣。一个基于SGML的面向语音合成的标记语言的分析。计算机工程,2000年8月,第26卷第8期。
    [11] Sport, R., Taylor, P., Tanenblatt, M. and Isard, A. A markup Language for Text-to-Speech Synthesis. In Proceedings of the Fifth European Conference on Speech Communication and Technology(Rhodes,1997), ESCA.
    [12] Mark R. Waller, Jim Larson, Intel Coperation, Andrew Hunt, SpeechWorks International. A NEW W3C MARKUP STANDARD FOR TEXT-TO-SPEECH SYNTHESIS
    [13] Extensible Markup Language(XML)1.0 Jan,2005 http://www.w3.org/TR/REC-xml
    [14] "SABLE,A Starndard for TTS Markup", R. Sproat, A. Hunt, M. Ostendorf, P. Taylor, A. Sydney, November,1998o.
    [15] 张子荣,初敏。解决多音字字—音转换的一种统计学习方法。
    [16] Klatt, D.H.(1980), software for a casecase/parallel informant synthesizer, JASA, vol.67,no.3, P.971-995
    [17] 吴志勇,蔡莲红。语音合成技术的原理和应用。
    [18] 蔡莲红。波形编辑语音合成技术及其在汉语TTS中的应用。小型微型计算机系统,1994年10月,Vol 15,No 10.
    [19] 张后旗,俞振利,张礼和。基于TD-PSOLA算法的汉语普通话韵律合成,科技通报,Jan,2002,第18卷第1期。
    [20] Hisashi KAWAI, Norio HIGUCHI,Tohru SIMIZU, Seiichi YAMAMOTO. DELVELOPMENT OF A TEXT-TO. SPEECH SYSTEM FOR JAPANESE BASED ON WAVEFORM SPLICING.
    [21] 周俏峰,蔡莲红。汉语句子重音的韵律参数模型研究,第三届全国人机语音通讯学术会议,1994,10,四川。
    [22] Weijun Chen, Fuzong Lin, Jianmin Li, Bo Zhang. A New ProsodicPhrasing Model for Chinese TTS Systems.
    [23] Ishitani, Y. Document transformation system from papers to XML data based on pivot XML document method. Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on, 3-6 Aug. 2003 Pages: 250-255 vol.1
    [24] 李净,徐明星,张继勇,郑方,吴文虎,方砾棠。汉语连续语音识别中声学模型基元比较:音节、音素、声韵母,第六届全国人机语音通讯学术会议,267—271页,2001年11月20—22日,深圳。
    [25] 罗小冬,裘雪红,刘凯。语音信号的基音标注算法,计算机与现代化,2003年第1期。
    [26] 王文剑,王长富,戴蓓倩,陆伟。基于藤崎模型的汉语语音基频轮廓的参数提取,小型微型计算机系统,1999年10第20卷第10期。
    [27] AKEMI IIDA, NICK CAMPBELL. Speech Database Design for a Concatenative Text-to-Speech Synthesis System for Individuals with Communication Disorders. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 6, 379-392, 2003@2003Kluwer Academic Publishers. Manufactured in The Netherlands.
    [28] Moulines E, Laroche J. None parametric techniques for pitch scale and time scale modification of Speech [J]. Speech Communication. 1995, 11(2), 175-187.
    [29] 易克初,田斌,付强。语音信号处理[M],北京:国防工业出版社,2000。

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700