语音合成音库自动标注方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,语音合成技术在技术研发和实际应用方面都得到了飞速的发展。合成语音在音质和自然度上均有了明显的提高。目前主流的语音合成方法主要有基于隐马尔可夫模型(Hidden Markov Model, HMM)的参数语音合成方法以及基于大语料库的波形拼接合成方法。在采用这些语音合成方法构建合成系统时,需要先进行音库的构建。音库构建所需要的语音资源可以通过多种方式获取:既可以专门针对语音合成进行语料设计并录制音库,也可以利用已有的语音数据(例如视频、有声读物等多媒体资源)。但无论对于哪一种方法,均离不开音库的标注
     合成音库的标注包括音段标注以及韵律标注:其中音段标注具体指标出音素序列并进行切分,音素切分指的是标注各个音素的起始和结束时间,切分信息通常只用于模型的初始化。现有的自动音段标注技术已经基本可以满足系统构建的需要。而韵律标注则是对语音的韵律信息进行标注,待标注的韵律类型与语言相关,例如对于中文合成系统韵律标注主要是指韵律层级的标注。韵律信息在合成系统中是作为模型的上下文信息来使用的,其标注的准确性将直接影响到合成语音的质量。对于合成音库的韵律信息,通常需要专业的标注人员进行标注。然而,随着音库规模的增大,人工标注的工作量急剧增加,此时通常需要多个标注人员参与韵律标注工作,标注的成本十分巨大;此外,韵律标注具有一定的主观性,保证不同标注人员之间标注结果的一致性较为困难。因此,如何通过计算机自动准确地进行合成音库的标注已成为当前的一个重要的研究方向。
     论文的研究工作将围绕合成音库的自动标注展开,针对不同的应用场景以及不同风格的音库,论文提出了相应的方法对韵律信息进行标注。整篇文章的主要工作包含以下几个方面:
     提出了基于HMM声学建模与状态解码的自动韵律标注方法。采用该方法进行合成音库自动标注的优势包括:在基于声学特征分布进行韵律标注时可以充分考虑其他已知标注信息对于分布参数的影响;通过整句解码的方式确定韵律标注结果,考虑了句中不同位置处韵律标注间的相关性;使用与语音识别类似的算法框架,可以借鉴语音识别中较为成熟的模型训练与解码算法。在具体实现中:我们首先提出基于穷举搜索的韵律短语边界自动标注方法,分析了合成系统中不同特征与上下文信息对韵律标注性能的影响,验证该方法的可行性;在此基础上我们又提出了基于维特比搜索的韵律短语自动标注方法,在保证标注结果准确性的前提下,提高了标注的效率。
     设计并实现了用于自动韵律标注的深度神经网络—隐马尔科夫模型(Deep Neural Network-HMM,DNN-HMM)声学建模方法,该方法利用了DNN相对于高斯混合模型(Gaussian Mixture Model,GMM)更强的声学建模能力进一步提高自动韵律标注的准确率。
     提出了结合特征聚类初始化与HMM声学建模的无监督自动韵律标注方法。该方法可以在没有人工韵律标注数据的情况下进行合成音库的自动韵律标注,从而自动地构建多发音人以及多发音风格的个性化语音合成系统。我们通过对朗读风格音库的韵律短语边界标注实验和对故事风格音库的重音位置标注实验,验证了该无监督韵律标注方法的有效性。
     提出了基于隐藏重音状态的无监督重音标注与合成方法。在上一部分的工作中,重音标注是作为一个普通的上下文信息参与决策树聚类,但是在重音单元数量比较少的情况下,重音信息在决策树聚类中难以得到体现,这样导致难以训练得到精确的重音/非重音模型,从而影响了重音标注的性能以及重音在合成语音中的体现。因此,这里我们考虑将重音信息从其他上下文信息中分离出来,引入重音状态层,使用线性变换来表征重音信息对声学特征分布的影响。该方法一方面可以避免重音稀疏性对模型精度的影响;另一方面该方法通过隐藏重音状态层以概率的形式对重音标注进行描述,改善了前一部分工作中二值化的重音标注对实际语音进行描述时的不足。
In recent years, the speech synthesis technology has been well developed in the aspects of R&D and practical applications. The naturalness and quality of the synthe-sized speech are improved significantly. Nowadays, the mainstream speech synthesis methods consist of Hidden Markov Model (HMM) based statistical parametric speech synthesis and large-corpus-based unit selection and waveform concatenations synthesis. Before constructing the speech synthesis systems by these speech synthesis methods, we need to construct the speech synthesis corpus first. These are several ways to obtain the speech data for this purpose. We can design the text material for speech synthesis first, then record the speech data. Alternatively, we can also utilize the existent speech data (such as the video, audiobook database etc). No matter how we obtain the speech data, the labeling is always necessary for constructing the corpus.
     Speech database annotation commonly consists of phonetic labeling and prosodic labeling. The phonetic labeling consists of obtaining the phoneme sequences and pho-netic segmentation. The phonetic segmentation stands for labeling the start and end time of each phoneme. Generally speaking, the phonetic segmentation is only used in the ini-tialization step of model training. The performance of the automatic phonetic labeling method is good enough for constructing the speech synthesis system. The prosodic labeling is to label the prosodic information for the speech data. The prosodic cate-gories to be labeled varies with languages. For the Mandarin speech synthesis corpora, the prosodic categories to be labeled mainly stand for the prosodic boundaries. Be-cause the prosodic labels are used as the context information for the model training, the accuracy of prosodic labeling will affect the naturalness and quality of the synthetic speech. The manual prosodic labels are commonly used in the speech synthesis cor-pora. However, the amount of manual labeling work increases significantly with the size of corpora. Thus, several human annotators are necessary for the prosodic labeling work, leading to high labor cost of this approach. In addition, the prosodic labels are judged subjectively, it is not easy to keep the consistency among different human anno-tators. Therefore, how to labeling the corpora preciously and automatically has become an important research direction.
     This dissertation focuses on the labeling of the speech synthesis corpora. Several different methods of automatic prosodic labeling are proposed according to the specific application and corpus style. The main work in the dissertation is listed as follow:
     The prosodic labeling method of HMM-based acoustic modeling and state decod-ing for the speech synthesis corpora is proposed. The advantages of this method are as follows:When using the acoustic feature distributions for the prosodic labeling, it can make full use of the known context information. The prosodic labeling results are obtained by decoding the whole sentence, the relation among the prosodic labels of different positions are considered. The framework which is similar to speech recogni-tion is adopted, with the advantage that it can be convenient to use the model training and deocding algorithm in the field of speech recognition. In the implementation, we proposed the exhaustive-search-based method first in order to analyze the influence of different features and context features on the labeling results and verify the feasibility of the proposed method. Then, the Viterbi-based method is proposed to speed up the labeling procedure, while maintaining the performance of labeling.
     The Deep Neural Network-HMM (DNN-HMM)-based acoustic modeling for the automatic prosodic labeling is designed and implemented, This method utilizes the DNN's strong ability for acoustic modeling to achieve a better performance of the au-tomatic prosodic labeling.
     The unsupervised prosodic labeling method of combining the feature-clustering initialization and HMM-based acoustic modeling is proposed. This method can be used to obtain the prosodic labels of the speech data without using the manual prosodic la-bels. So the personalized speech synthesis systems with multi-speakers and multi-styles can be constructed automatically by this method. We validated the effectiveness of the method by the experiments of prosodic phrase boundaries labeling of the reading style corpus and the emphasis expression labeling of the audiobook database.
     The hidden-emphasis-state-based unsupervised emphasis expression labeling and synthesizing method is proposed. This is necessary because in our previous work, the emphasis expression label is treated as one kind of common context information. When the amount of emphatic units is small, the emphasis expression labels have little influ-ence on the decision tree. It is difficult to train the precise emphatic and neutral models in this condition. This can then affect the labeling performance of the emphasis expres-sion and the exhibition of emphasis expression in the synthetic speech. In order to avoid this problem, we consider to separate the emphasis expression labels from other context information and represent them by the linear transformations. In addition, this method describes the labels of emphasis expression in a probabilistic way, which can describe the emphasis expression in the speech better than the binary way used in our previous work.
引文
ANANTHAKRISHNAN S., NARAYANAN S..2005. An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP):269-272.
    ANANTHAKRISHNAN S., NARAYANAN S..2006. Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling[C]//Proc. INTERSPEECH:829-832.
    BAUM L. E., PETRIE T..1966. Statistical inference for probabilistic functions of finite state Markov chains[J]. the Annals of Mathematical Statistics:1554-1563.
    BAUM L. E., PETRIE T., SOULES G., et al.1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains[J]. the Annals of Mathematical Statistics:164-171.
    CHEN K., HASEGAWA-JOHNSON M., COHEN A..2004a. An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP):509-512.
    CHEN K., HASEGAWA-JOHNSON M., COHEN A., et al.2004b. A maximum likelihood prosody recog-nizer[C]//Speech Prosody.
    CHEN L.-H., YANG C.-Y., LING Z.-H.,et al.2011. The USTC system for Blizzard Challenge 2011[C]//Blizzard Challenge Workshop.
    CHEN Y.-N., LAI M., CHU M., et al.2006. Automatic Accent Annotation with Limited Manually Labeled Data[C]//Speech Prosody.
    CHIANG C.-Y., CHEN S.-H., YU H.-M., et al.2009. Unsupervised joint prosody labeling and modeling for Mandarin speech[J]. Acoustical Society of America Journal,125:1164-1183.
    CHOU F.-C., TSENG C.-Y., LEE L.-S..1998. Automatic segmental and prosodic labeling of Mandarin speech database[C]//International Conference on Spoken Language Processing (ICSLP).
    CHU M., PENG H., YANG H.-Y., et al.2001. Selecting non-uniform units from a very large corpus for concatena-tive speech synthesizer[C]//Intemational Conference on Acoustics, Speech, and Signal Processing(ICASSP):785-788.
    CONKIE A., RICCARDI G., ROSE R. C..1999. Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events[C]//:523-526.
    DAHL G. E., YU D., DENG L., et al.2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing,20(1):30-42.
    DONOVAN R. E..1996. Trainable speech synthesis[D]. Cambridge, United Kingdom:Cambridge University Engineering Department.
    EYBEN F., BUCHHOLZ S., BRAUNSCHWEILER N..2012. Unsupervised clustering of emotion and voice styles for expressive TTS[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):4009-4012.
    FANT G..1960. Acoustic Theory of Speech Production[M]. The Hague.Netherlands:Mouton.
    HALL M., FRANK E., HOLMES G., et al.2009. The WEKA data mining software:an update[J]. ACM SIGKDD Explorations Newsletter, 11(1):10-18.
    HASEGAWA Y..1999. Pitch accent and vowel devoicing in Japanese[C]//Proceedings of the 14th International Congress of Phonetic Sciences(ICPhS):523-526.
    HINTON G., DENG L., YU D., et al.2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition[J]. IEEE Signal Processing Magazine,29(6):82-97.
    HINTON G. E..2012. A practical guide to training restricted boltzmann machines[M]//Neural Networks:Tricks of the Trade.[S.1.]:Springer:599-619.
    HINTON G. E., SEJNOWSKI T. J..1986. Learning and relearning in Boltzmann machines[J]. Cambridge, MA: MIT Press,1:282-317.
    HINTON G. E., OSINDERO S., TEH Y.-W..2006. A fast learning algorithm for deep belief nets[J]. Neural computation,18(7):1527-1554.
    HU W.-X., HUANG T.-Y., XU B..2002. Study on prosodic boundary location in Chinese Mandarin[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP):501-504.
    HUANG J.-T., HASEGAWA-JOHNSON M., SHIH C..2008. Unsupervised prosodic break detection in Mandarin speech[C]//Speech Prosody:165-168.
    HUANG X.-D., ACERO A..ADCOCKJ.,et al.1996. Whistler:A trainable text-to-speech system[C]//International Conference on Spoken Language Processing (ICSLP):2387-2390.
    HUNT A. J., BLACK A. W..1996. Unit selection in a concatenative speech synthesis system using a large speech database[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):373-376.
    KAWAHARA H., MASUDA-KATSUSE I., CHEVEIGNE A. D..1999. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:Possible role of a repetitive structure in sounds[J]. Speech Communication,27:187-207.
    KIM Y.-J., CONKIE A..2002. Automatic segmentation combining an HMM-based approach and spectral boundary correction.[C]//INTERSPEECH:145-148.
    KING S., KARAISKOS V..2012. The Blizzard Challenge 2012[C]//Blizzard Challenge Workshop.
    KLATT D. H..1980. Software for a cascade/parallel formant synthesizer[J]. the Journal of the Acoustical Society of America,67(3):971-995.
    KLATT D. H..1987. Review of text-to-speech conversion for English[J]. The Journal of the Acoustical Society of America,82(3):737-793.
    LAI M., CHEN Y., CHU M., et al.2006. A hierarchical approach to automatic stress detection in English sentences[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP):753-756.
    LEHISTE I..1970. Suprasegmentals[M]. Cambridge, Massachusetts, United States:MIT Press.
    LEMMETTY S..1999. Review of speech synthesis technology[D]. Espoo, Finland:Helsinki University of Tech-nology.
    LI A.-J..2002. Chinese prosody and prosodic labeling of spontaneous speech[C]//Speech Prosody:39-46.
    LING Z.-H., WANG R.-H..2007. HMM-based hierarchical unit selection combining Kullback-Leibler divergence with likelihood criterion[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):1245-1248.
    LING Z.-H., WU Y.-J., WANG Y.-P., et al.2006. USTC System for Blizzard Challenge 2006[C]//Blizzard Challenge Workshop.
    LING Z.-H., QIN L., LU H., et al.2007. The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007[C]//Blizzard Challenge Workshop.
    LING Z.-H., WANG Z.-G., DAI L.-R..2010. Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis[C]//International Symposium on Chinese Spoken Language Processing (ISCSLP):144-147.
    LING Z.-H., XIA X.-J., SONG Y., et al.2012. The USTC System for Blizzard Challenge 2012[C]//Blizzard Challenge Workshop.
    LIU F.-Z., JIA H.-B., TAO J.-H..2008. A maximum entropy based hierarchical model for automatic prosodic bound-ary labeling in Mandarin[C]//International Symposium on Chinese Spoken Language Processing (ISCSLP):257-260.
    MA X.-J., ZHANG W., SHI Q., et al.2003. Automatic prosody labeling using both text and acoustic informa-tion[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP):I-516.
    MAENO Y., NOSE T., KOBAYASHIT., et al.2011. HMM-based emphatic speech synthesis using unsupervised context labeling[C]//Proc. INTERSPEECH:1849-1852.
    MAENO Y., NOSE T., KOBAYASHI T., et al.2013. HMM-based expressive speech synthesis based on phrase-level F0 context labeling[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):7859-7863.
    MALFRERE F., DUTOIT T..1997. High-quality speech synthesis for phonetic speech segmentation[C]//Proc. EUROSPEECH:2631-2634.
    MOHAMED A.-R., DAHL G. E., HINTON G..2012. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing,20(1):14-22.
    MOULINES E., CHARPENTIER F..1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones[J]. Speech communication,9(5):453-467.
    NEY H., ORTMANNS S..1999. Dynamic programming search for continuous speech recognition[J]. Signal Processing Magazine, IEEE,16(5):64-83.
    NI C.-J., LIU W.-J., XU B..2008. Automatic prosody boundary labeling of Mandarin using both text and acoustic information[C]//Intemational Symposium on Chinese Spoken Language Processing (ISCSLP):354-357.
    NI X.-Q., CHEN Y.-N., SOONG F. K., et al.2007. An unsupervised approach to automatic prosodic annota-tion.[C]//Proc. INTERSPEECH:486-489.
    PRAHALLAD K., BLACK A. W..2011. Segmentation of monologues in audio books for building synthetic voices[J]. IEEE Transactions on Audio, Speech, and Language Processing,19(5):1444-1449.
    QUINLAN J. R..1992. C4.5:programs for machine learning[M]. San Meteo, California:Morgan kaufmann.
    RABINER L..1989. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE,77(2):257-286.
    RABINER L. R., JUANG B..1993. Fundamentals of speech recognition[M]. New Jersey, United States:PTR Prentice Hall Englewood Cliffs.
    RANGARAJAN V., NARAYANAN S., BANGALORE S..2006. Acoustic-syntactic maximum entropy model for automatic prosody labeling[C]//ACL Spoken Language Technology Workshop(SLT):74-77.
    RANGARAJAN V., BANGALORE S., NARAYANAN S. S..2008. Exploiting Acoustic and Syntactic Features for Automatic Prosody Labeling in a Maximum Entropy Framework[J]. IEEE Transactions on Audio, Speech, and Language Processing,16(4):797-811.
    SAINATH T. N., KINGSBURY B., RAMABHADRAN B., et al.2011. Making deep belief networks effective for large vocabulary continuous speech recognition[C]//IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU):30-35.
    SHINODA K., WATANABE T..2000. MDL-based context-dependent subword modeling for speech recognition[J]. Journal of Acoustic Society of Japan,21(2):79-86.
    TAYLOR P..2009. Text-to-speech synthesis[M]. Cambridge, United Kingdom:Cambridge University Press.
    TOKUDA K., KOBAYASHI T., IMAI S..1995. Speech parameter generation from HMM using dynamic features[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):660-663.
    TOKUDA K., MASUKO T., MIYAZAKI N., et al.1999a. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),1:229-232.
    TOKUDA K., MASUKO T., MIYAZAKI N., et al.1999b. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):229-232.
    TOLEDANO D. T., GOMEZ L. A. H., GRANDE L. V..2003. Automatic phonetic segmentation[J]. IEEE Transactions on Speech and Audio Processing,11(6):617-625.
    TSENG C.-Y., CHOU F.-C..1999. A prosodic labeling system for Mandarin speech database[C]//Proceedings of the 14th International Congress of Phonetic Science:2379-2382.
    WANG R.-H., MA Z.-K., LI W., et al.2000. A corpus-based Chinese speech synthesis with contextual-dependant unit selection[C]//International Conference on Spoken Language Processing (ICSLP):391-394.
    WIGHTMAN C., PRICE P., PIERREHUMBERT J., et al.1992. ToBI:A standard for labeling English prosody[C]//International Conference on Spoken Language Processing (ICSLP):12-16.
    WIGHTMAN C. W., OSTENDORF M..1994. Automatic labeling of prosodic patterns[J]. IEEE Transactions on Speech and Audio Processing,2(4):469-481.
    WU Y.-J., KAWAI H., NI J.-F., et al.2004. Minimum segmentation error based discriminative training for speech synthesis application[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP):629-632.
    WU Y.-J., KAWAI H., NI J.-F., et al.2005. Discriminative training and explicit duration modeling for HMM-based automatic segmentation[J]. Speech Communication,47(4):397-410.
    YANG C.-Y., LING Z.-H., LU H., et al.2010. Automatic phrase boundary labeling for Mandarin TTS corpus using context-dependent HMM[C]//International Symposium on Chinese Spoken Language Processing (ISCSLP):374-377.
    YANG C.-Y., LING Z.-H., DAI L.-R..2013. Unsupervised prosodic phrase boundary labeling of Mandarin speech synthesis database using context-dependent HMM[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):6875-6879.
    YOSHIMURA T., TOKUDA K., MASUKO T., et al.1999. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis[C]//Proc. EUROSPEECH:2347-2350.
    YOUNG S., EVERMANN G., GALES M., et al.2006. The HTK book (for HTK version 3.4)[J]. Cambridge University Engineering Department.
    YOUNG S J T. J. H. S., RUSSELL N H.1989. Token Passing:a Simple Conceptual Model for Connected Speech Recognition Systems[R]. Cambridge:Cambridge University.
    YU K., MAIRESSE F., YOUNG S..2010. Word-level emphasis modelling in HMM-based speech synthe-sis[C]//International Conference on Acoustics, Speech, and Signal Processing(ICASSP):4238-4241.
    YU K., ZEN H., MAIRESSE F., et al.2011. Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis[J]. Speech Communication,53(6):914-923.
    ZEN H., TODA T..2005. An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005[J].
    ZEN H., TOKUDA K., BLACK A. W..2009. Statistical parametric speech synthesis[J]. Speech Communication, 51(11):1039-1064.
    ZHAO Y., PENG D., WANG L.-J., et al.2006. Constructing stylistic synthesis databases from audio books[C]//Proc. INTERSPEECH:1750-1753.
    吕士楠,初敏,许洁萍,等.2012.汉语语音合成—原理和技术[M].北京:科学出版社.
    吴义坚.2006.基于隐马尔科夫模型的语音合成技术研究[D].安徽合肥:中国科学技术大学.
    李剑锋.2005.韵律层次预测中基于统计模型的机器学习方法研究[D].安徽合肥:中国科学技术大学.
    李爱军.2002.普通话对话中韵律特征的声学表现[J].中国语文,6:525-535.
    杨辰雨,朱立新,凌震华,等.2011.基于Viterbi解码的中文合成音库韵律短语边界自动标注[J].清华大学学报:自然科学版,51(9):1276-1281.
    王洪君.2000.汉语的韵律词与韵律短语[J].中国语文,6:525-536.
    王洪君.2008.汉语非线性音系学[M].北京:北京大学出版社.
    蔡莲红,赵世霞.1999.汉语语音合成语料库的研究与建立[J].语言文字应用,31(2).
    赵力.2009.语音信号处理[M].北京:机械工业出版社.
    韩纪庆,张磊,郑铁然.2004.语音信号处理[M].北京:清华大学出版社.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700