Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis

详细信息查看全文

作者：V. Ramu Reddy ; ^{ramu.csc@gmail.com} ; K. Sreenivasa Rao ^{ksrao@iitkgp.ac.in}
关键词：Intonation models ; Prediction accuracy ; Text-to-speech synthesis ; Feedforward neural networks ; Linguistic constraints ; Production constraints ; Positional ; Contextual ; Phonological ; Articulatory ; F0 of syllable ; Tilt
刊名：Computer Speech & Language
出版年：2013
出版时间：August, 2013
年：2013
卷：27
期：5
页码：1105-1126
全文大小：1383 K

文摘

This paper proposes a two-stage feedforward neural network (FFNN) based approach for modeling fundamental frequency (F₀) values of a sequence of syllables. In this study, (i) linguistic constraints represented by positional, contextual and phonological features, (ii) production constraints represented by articulatory features and (iii) linguistic relevance tilt parameters are proposed for predicting intonation patterns. In the first stage, tilt parameters are predicted using linguistic and production constraints. In the second stage, F₀ values of the syllables are predicted using the tilt parameters predicted from the first stage, and basic linguistic and production constraints. The prediction performance of the neural network models is evaluated using objective measures such as average prediction error (¦Ì), standard deviation (¦Ò) and linear correlation coefficient (¦Ã_X,Y). The prediction accuracy of the proposed two-stage FFNN model is compared with other statistical models such as Classification and Regression Tree (CART) and Linear Regression (LR) models. The prediction accuracy of the intonation models is also analyzed by conducting listening tests to evaluate the quality of synthesized speech obtained after incorporation of intonation models into the baseline system. From the evaluation, it is observed that prediction accuracy is better for two-stage FFNN models, compared to the other models.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700