藏语拉萨话新闻体韵律模型研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着基于大规模语料库波形拼接技术的发展,语音合成系统的研究取得了重大的进展,合成语音的可懂度已经能够满足实际应用的需要。但是,合成语音的自然度依然不够理想,这主要是由于合成系统中的韵律模型还不是很完善。为了消除合成语音与人类自然语流之间的差异,从而合成出具有较高自然度的语音,就必须建立高质量的韵律模型。目前比较流行的方法就是采用数据驱动的方法,通过大量的语料来训练模型,从而使得其可以输出高质量的韵律控制参数,提高合成语音的自然度。
     人类的口语语流在实现时受到生理机制的制约,其中生理机制主要指的是呼吸调节,呼吸是划分韵律层级的重要线索。研究呼吸与韵律层级之间的联系,确定影响韵律特征的呼吸信号参数,并将其与语音信号参数共同作为训练数据,是一种韵律建模新的处理方法,对于建立高质量韵律模型是一次全新尝试。
     本文针对藏语语音合成的实际开发需要,采用新闻文本作为训练语料,研究了藏语拉萨话的语音和韵律特性,确定了影响韵律特征的呼吸信号参数,采用RBF神经网络的方法建立了藏语拉萨话新闻体韵律模型,实现了韵律控制参数的预测,主要工作包括:
     1、研究了藏语拉萨话的语音特性,结合汉语韵律结构的研究成果,确定了藏语的韵律层次,分析了藏语的韵律结构和及其特性,确定了能够反映韵律特征的参数,作为韵律模型的输入参数集。
     2、收集了一年的《西藏日报》,根据藏语拉萨话的特点,进行了文本设计和优化,使得所选用的语料基本覆盖了拉萨话的音段和超音段特性,经过规范化处理和录音,设计了符合藏语特征的韵律标注规则,建立了藏语拉萨话韵律模型语料库。
     3、根据人类发声时的生理机制,研究了发声时呼吸信号的变化特点,经过数据分析,确定了呼吸信号和韵律特征之间了对应关系,并采集了相关参数作为模型的训练参数。
     4、根据之前韵律结构分析研究的结果,确定了反映韵律特征的6组39维语境信息参数,使用RBF神经网络,建立韵律模型,输出参数为10维韵律控制参数。使用语料库中已标注完成的语料对模型进行训练和实验,分析结果可知该模型具有良好的预测性能。
Based on the fast development of large-scale corpus waveform joining technology, speech synthesis system studying has gained significant progress and the synthesized speech intelligibility has been able to meet the needs of practical applications. However, the naturalness of synthesized speech is still insufficiently ideal, mainly because of deficient prosodic model in synthesis system. High-quality prosodic model must to be established in order to eliminate the difference between synthesized speech and human nature language flow for higher naturalness of speech. At present, the data-driven method is much more popular than others, using lots of corpora to do model training for outputting high-quality prosodic control parameters and improving the speech synthesized naturalness.
     To realize the oral language flow is limited by the human physiological mechanism, mainly referring to respiratory regulation. Respiration is as an important clue for prosodic layer classification. Research on the interrelationship between respiration and prosodic layers, to confirm the respiratory signal parameters of prosodic features and regard them as training parameters, is considered as a new type of prosodic model processing and a new attempt for establishing high-quality prosodic model.
     In accordance with the actual development for Tibetan speech synthesis, the paper has taken news text as training corpora, analyzed the speech and prosodic features of Tibetan Lhasa dialect and confirmed the respiratory signal parameters with prosodic features, then adopted RBF neural network to establish Tibetan Lhasa news prosodic model and finally realized the predictions of prosodic control parameters. The main work includes as follows:
     1. Research the speech features of Tibetan Lhasa dialect combining with the previous results of Chinese prosodic structure; confirm the Tibetan prosodic layers and analyzed Tibetan prosodic structure and features; determine the parameters being able to reflect prosodic features regarded as input parameter-set for prosodic model.
     2. Collect the Tibetan Daily for the whole year; design and optimize the texts according to Tibetan Lhasa dialect features; make sure that all corpora has covered the Tibetan speech segments and supra-segments; design the prosodic labeling principles suitable with Tibetan features after normalizing and speech recording; establish the prosodic model corpus for Tibetan Lhasa dialect.
     3. Research the changing features of respiratory signals during breathing according to human physiological mechanism; confirm the corresponding relationship between respiratory signals and prosodic features after data analysis and collect the related parameters used for model training parameters.
     4. Confirm 6 classes of 39 dimensions context feature parameters in terms of previous prosodic structure analysis results; use RBF neural network to establish prosodic model and output 10 dimensions prosodic control parameters; make use of the labeled corpora in corpus for model training and testing to know the predictable nature of the established model.
引文
[1]. Fujisaki, H, Hirose K. Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J.Acoust. Soc. Jpn.(E),1984;5(4): 233-242.
    [2]. Fujisaki, H.“Information, prosody, and modeling― with emphasis on tonal features of speech,”Proc. Speech Prosody 2004.
    [3]. D Chappell, J Hansen. Speaker specific pitch contour modeling and modification [A]. Proc ICASSP [C]. Seattle , USA : IEEE , May 1998.
    [4]. Steffen Werner, Matthias Wolff, Matthias Eichner. Modeling pronunciation variation for spontaneous speech synthesis [C]// ICASSP. Montreal, Canada,2004.
    [5]. Campbell N. Synthesis units for conversational speech-using phrasal segments [C]// Prof ASJ Fall. Okinawa, 2004.
    [6]. Pierrehumbert, J.B., (1980), The phonology and ph onetics of English Intonation, PhD Dissertation, MIT, MA(published in 1988 by IULC).
    [7]. Jin Fei, Zhen Zhu and I. Pavlidis, Imaging Breathing Rate in the CO2 Absorption Band, 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. China, vol. 1, pp. 700-705, 2005.
    [8]. O'BRIEN CM and JEFFERY HE, Sleep deprivation, disorganization and fragmentation during opiate withdrawal in newborns, Journal of Paediatrics and Child Health, vol. 38, no. 1, pp. 66-71, January 2002.
    [9]. Chu M, Peng H. et al. Selecting non-uniform units from a very large corpus for concatenative speech synthesizer. ICASSP2001, 2001:785-788.
    [10]. Chu M, Peng H, and Chang E. A concatenativeMandarin TTS system without prosody model and prosody modification, Proc. Of 4th ISCA Tutorial and Reseach Workshop on Speech Synthesis.
    [11]. Chen S., Hwang, S. and Wang Y., An RNN-based procsodic information synthesizer for Mandarin text-to-speech, IEEE transactions on speech and audio processing, Vol. 6, No. 3, 226-229.
    [12].陶建华,蔡莲红,赵世霞,吴志勇.汉语文语转换系统中可训练韵律模型的研究[J].声学学报,2001,01.
    [13].陶建华,蔡莲红.基于音节韵律特征分类的汉语语音合成中韵律模型的研究[J].声学学报,2003,09.
    [14].于剑,黄力行,陶建华.汉语对话语气韵律建模方法[J].清华大学学报(自然科学版), 2007,09
    [15].徐俊,蔡莲红.面向情感转换的层次化韵律分析与建模[J].清华大学学报(自然科学版), 2007,09
    [16].马欢,吾守尔·斯拉木.维吾尔语文语转换系统文本分析模块初探[J].计算机工程, 2006,08.
    [17].胡坦.藏语(拉萨话)声调研究[J].民族语文,1980,01.
    [18].孔江平.藏语(拉萨话)声调感知研究[J].民族语文,1995,3.
    [19].鲍怀翘,徐昂,陈嘉猷.藏语拉萨话语音声学参数数据库[J].民族语文,1992,5.
    [20].孔江平.道孚藏语双擦音声母的声学分析[J].民族语文,1991,3.
    [21].孔江平.道孚藏语双塞音声母的声学分析[J].民族语文,1991,2.
    [22].谭晶晶,孔江平,李永宏.汉语普通话不同文体朗读时的呼吸重置研究[J].清华大学学报(自然科学版),2007,09
    [23].陈琪,于洪志,李永宏.言语呼吸信号分析平台的设计与实现[J].计算机工程与应用,2009(S20).
    [24].刘俐李.近八十年汉语韵律研究回望[J].语文研究,2007,02.
    [25].蔡莲红,黄德智,蔡锐.现代语音技术基础与应用[M].北京:清华大学出版社,2003.
    [26]. Li, A., et al., Speech corpus of Chinese discourse and the phonetic research. Proc. ICSLP 2000, vol. 4: 13-18.
    [27]. Li, A., Chinese prosody and prosodic labeling of spontaneous speech. Proc. Speech Prosody 2002, 39-46.
    [28]. Tseng, C., Chou, F., A prosodic labeling system for Mandarin speech database. Proc. 14thISPhS 1999, 2379-2382.
    [29]. Tseng, C., et al., Fluent speech prosody: framework and modeling. Speech Communication, 46: 284-309.
    [30].顾文涛,藤崎博也.汉语韵律结构[J].中国语音学报,2008,第1辑.
    [31].冯胜利.论汉语的“韵律词”[J].中国社会科学,1996,01.
    [32].黄贤军,杨玉芳,吕士楠.基于CART的韵律短语切分[C].第九届全国人机语音通讯学术会议论文集, 2007年.
    [33].王蓓,杨玉芳,吕士楠.汉语韵律层级边界结构的声学相关物[C].第五届全国现代语音学学术会议论文集,2001.
    [34].王蓓,杨玉芳,吕士楠.汉语韵律层级边界结构的声学分析[J].声学学报,2004,01.
    [35].王蓓,吕士楠,杨玉芳.汉语语句中重读音节音高变化模式研究[J].声学学报,2002,03.
    [36]. Ron J. Baken et al. (1979) Chest wall movements prior to phonation. Journal of Speech and Hearing Research, 22, 862-872.
    [37].冯葆富、齐忠政、刘运墀(1981)《歌唱医学基础》,上海科技出版社,上海.
    [38]. Thomas J. Hixon (1987) Respiratory Function in Speech and Song. College-Hill Press. San Diego.p94
    [39]. C. Tseng et al.( 2005 ) Fluent speech prosody: framework and modeling. Speech Communication 46 (2005) 284-309
    [40]. C. Tseng (2006) Prosody analysis. Advances in Chinese spoken language Processing. World Scientific Publishing Co.Pte.Ltd. Singapore. pp: 57-76
    [41].李永宏,于洪志.安多藏语语音合成语料库的设计[J].西北民族大学学报,2006,01.
    [42].金鹏.藏语简志[M].北京:人民出版社,1983.
    [43].孙岭,胡郁,王仁华.中文语音合成系统中的语料库设计[C].第六届全国人机语音通讯学术会议论文集, 2001年.
    [44].宁振江,杜利民.一种改进后的递增式语音语料抽取算法[J].中国科学院研究生院学报,2005,02.
    [45].蔡莲红,赵世霞.汉语语音合成语料库的研究与建立[J].语言文字应用,1999,03.
    [46].蔡莲红,崔丹丹,蔡锐.汉语普通话语音合成语料库TH-CoSS的建设和分析[J].中文信息学报,2007,02.
    [47].李永宏,孔江平,于洪志.藏语文-音自动规则转换及其实现[J].清华大学学报(自然科学版),2008,04.
    [48].谭克让.藏语拉萨话声调分类和标发刍议[J].
    [49].孔江平.藏语(拉萨话)声调感知研究[J].民族语文,1995,03.
    [50].韩纪庆,张磊,郑铁然.语音信号处理[M].北京:清华大学出版社,2004.
    [51].石锋,黄彩玉.哈尔滨话单字音声调的统计分析[J].汉语学习,2007,01.
    [52]. Chen Qi, Yu Hongzhi, Chen Chen, Shi Jing. A Study on Lhasa Tibetan Prosodic Model Corpus Establishment. Proc. Of ICICTA2010, 2010 VolumeⅠ:365-368.
    [53].陶建华,蔡莲红.韵律数字建模与韵律研究[C].第五届全国现代语音学术会议论文集,2001年.
    [54]. Simon Haykin著,叶世伟,史忠植译.神经网络原理[M].北京:机械工业出版社,2004.
    [55].边肇祺,张学工.模式识别[M].北京:清华大学出版社,2001.
    [56].阎平凡,张长水.人工神经网络与模拟进化计算[M].北京:清华大学出版社,2003.
    [57].飞思科技产品研发中心.神经网络理论与MATLAB7实现[M].北京:电子工业出版社,2005.
    [58].陶建华,蔡莲红.基于音节韵律特性分类的汉语语音合成中韵律模型的研究[J].声学学报,2003,05.
    [59].陶建华,蔡莲红,吴志勇.基于统计模型的韵律建模方法[C].第六届全国人机语音通讯学术会议论文集, 2001年
    [60].周迅溢,王蓓,杨玉芳,李晓庆.语句中协同发音对音节知觉的影响[J].心理学报, 2003,03.
    [61].吴志勇,蔡莲红.语音合成中的韵律关联模型[J].中文信息学报,2004,02.
    [62]. Chen Qi, Yu Hongzhi, Chen Chen and Shi Jing. A Study on Tibetan Prosodic Model of Speech and Respiratory Signals. Proc. Of ICIA2010, 2010
    [63].陶建华,蔡莲红,赵世霞,吴志勇.汉语文语转换系统中可训练韵律模型的研究[J].声学学报,2001,01.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700