基于统计声学建模的语音合成技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近十几年来,随着针对语音信号的统计建模方法的日益成熟以及参数合成器性能的不断提升,统计参数语音合成(Statistical Parametric Speech Synthesis)思想被提出,并得到了越来越多研究者的关注。其中,以基于隐马尔可夫模型(Hidden Markov Model,HMM)的参数语音合成方法为代表,该方法已逐步发展成为和基于语料库的单元挑选与波形拼接合成方法相并列的一种主流语音合成方法。相比传统的单元挑选与波形拼接合成方法,基于HMM的参数语音合成方法具有合成语音流畅度高、鲁棒性好,系统构建速度快、自动化程度高,系统尺寸小、灵活度高等优点。
     本文以统计声学模型在语音合成中的应用为研究重点,在原有基于HMM的参数合成方法之外,提出了两种新的基于统计声学建模的语音合成方法。第一,基于HMM的单元挑选与波形拼接合成:我们将HMM参数语音合成中使用的声学参数建模思想,与传统的单元挑选与波形拼接合成方法相结合,使用概率准则指导最优单元搜索,通过拼接波形生成最终语音,以克服参数合成方法在生成语音音质上的不足,提高合成语音的自然度;第二,融合声学参数与发音器官参数(Articulatory Feature)的建模与合成:我们在声学参数之外,引入和语音产生机理更加紧密相关的发音器官参数,通过对原有的HMM模型结构进行改进,实现两种参数的联合建模与生成,从而提高合成时声学参数预测的精确度和灵活性。
     整篇文章的安排如下:
     第1章是绪论,将回顾语音合成的发展历史,并对常见的几种语音合成方法进行简要的介绍。
     第2章将具体介绍基于HMM的参数语音合成方法,包括HMM的基本原理、系统框架、关键技术点等,并通过对此方法特点的分析,阐明我们进行新的语音合成方法研究的动机与出发点。
     第3章将重点介绍基于HMM的单元挑选与波形拼接语音合成算法。首先我们提出了使用HMM进行单元挑选的两种不同的实现形式,一种以帧为拼接单元,基于最大似然准则实现单元搜索,另一种使用音素和帧的两级拼接单元,结合似然值准则和Kullback-Leibler距离(Kullback-Leibler Divergence,KLD)进行单元选择;然后,我们归纳出了基于HMM的单元挑选合成的统一算法框架,并通过在中文和英文合成系统上的测试证明了此算法的有效性;最后,我们提出了最小单元挑选错误(Minimum Unit Selection Error,MUSE)准则,用以替代原有HMM训练中使用的最大似然准则,实现了合成系统的全自动构建,并进一步提高了合成语音的自然度。
     第4章将介绍融合发音器官参数与声学参数的统计建模与合成。这里的“发音器官参数”指的是对发音过程中说话者舌、唇、下颚等发音器官的位置以及运动情况的定量描述。在阐明了引入发音器官参数的原因以及对原有系统框架进行了简单回顾后,我们提出了对声学参数和发音器官参数进行联合建模与参数生成的总体思路,并且从模型聚类策略、状态的同步性假设以及特征之间的独立性假设三个方面,讨论了几种可能的模型结构;然后,通过一系列的客观和主观评测,证明了这种结合发音器官参数的系统构建方法在提高声学参数预测的精确度和灵活性方面的有效性。
     第5章对全文进行了总结。
With the development of statistical modeling techniques for speech signals and the performance improvement of parametric speech synthesizer, statistical parametric speech synthesis methods have been proposed and made significant progress in the last decade. One representative approach of these methods is Hidden Markov Model (HMM) based parametric synthesis, which has become a mainstream speech synthesis approach together with the unit selection and waveform concatenation approach. This method has a lot of advantages compared with the conventional unit selection speech synthesis, such as high smoothness, robustness and flexibility, fast and automatic system construction, small system footprint, and so on.
     This dissertation focuses on the application of statistical acoustic model to speech synthesis. Besides the original HMM-based parametric synthesis approach, two novel methods are proposed. The first is HMM-based unit selection and waveform concatenation synthesis. We apply the statistical ideas in HMM-based parametric synthesis to unit selection and waveform concatenation system to overcome the shortcoming of speech quality for parametric synthesis system and improve the naturalness of synthesized speech. The second method is parametric synthesis for integrated acoustic and articulatory features. Considering that articulatory features give better representation of speech generation mechanism, we integrate articulatory features into HMM-based parametric synthesis system to improve the accuracy and flexibility of acoustic parameter generation by simultaneous modeling and generation of acoustic and articulatory features.
     The whole dissertation is organized as follow:
     Chapter 1 is the introduction. It reviews the history of speech synthesis research and gives a brief introduction to the several most common speech synthesis techniques.
     Chapter 2 introduces the HMM-based parametric synthesis method in detail, including the fundamental principles of HMM, the system framework, and some key techniques in the system. Based on some analysis of the characteristics of this method, the motivation of our research work is declared.
     Chapter 3 focuses on the HMM-based unit selection synthesis method. At first, two different HMM-based unit selection systems are introduced. The first system adopts frame-sized unit and maximum likelihood criterion for unit selection; the second system uses hierarchical units and combines Kullback-Leibler divergence together with likelihood criterion to select the optimal unit sequence. Then, a unified framework of HMM-based unit selection speech synthesis method is proposed. Our evaluations on Chinese and English systems prove the effectiveness of the proposed method. At last, Minimum Unit Selection Error (MUSE) criterion for the model training of HMM-based unit selection system is proposed to achieve fully automatic system construction and improve the naturalness of synthesized speech.
     Chapter 4 presents a method that integrating articulatory features into the original HMM-based parametric synthesis system where only acoustic features are used. Here, we use "articulatory features" to refer to the quantitative positions and continuous movements of a group of articulators. These articulators include the tongue, jaw, lips, velum, and so on. After a brief introduction to the original system, the modeling and parameter generation methods for unified acoustic and articulatory features are proposed. Different model structures are explored to allow the articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. The results of objective and subjective evaluation show that the accuracy and flexibility of acoustic parameter prediction can be improved effectively by proposed method.
     Chapter 5 concludes the whole dissertation.
引文
[1]Akgul,Y.,C.Kambhamettu,M.Stone(1998)."Extraction and tracking of the tongue surface from ultrasound image sequences." IEEE Comp.Vision and Pattern Recog.124:298-303.
    [2]Atal,B.S.,S.L.Hanauer(1971)."Speech analysis and synthesis by linear prediction of the speech wave." J.Acoust.Soc.Am.50:637-655.
    [3]Baer,T.,J.C.Gore,S.Boyce,P.W.Nye(1987)."Application of MRI to the analysis of speech production." Magnetic Resonance Imaging 5:1-7.
    [4]Bailly,G.,B.Holm(2002)."Learning the hidden structure of speech:from communicative functions to prosody." Cadernos de Estudos Linguisticos 43:37-54.
    [5]Baum,L.E.,T.Petrie(1966)."Statistical inference for probabilistic functions of finite state Markov chains." Ann.Math.Stat.37:1554-1563.
    [6]Baum,L.E.,T.Petrie,G.Soules,N.Weiss(1970)."A maximization technique occuring in the statistical analysis of probabilistic function of Markov chains." Ann.Math.Stat.41:164-171.
    [7]Black,A.W.,K.Tokuda(2005).The Blizzard Challenge 2005:Evaluating corpus-based speech synthesis on common databases.Interspeech:77-80.
    [8]Blum,R.J.(1954)."Multidimensional stochastic approximation method." Ann.Mat.Stat.25:737-744.
    [9]Campbell,W.N.,A.W.Black(1996).Prosody and the selection of source units for concatenative synthesis.Progress in Speech Synthesis.J.V.Santen,Springer Verlag.
    [10]Cetin,Q.,M.Ostendorf(2003).Cross-stream observation dependencies for multi-stream speech recognition.Eurospeech:2517-2520.
    [11]Chu,M.,H.Peng,H.Yang,E.Chang(2001).Selecting non-uniform units from a very large corpus for concatenative speech synthesizer.ICASSP:785-788.
    [12]Coker,C.H.(1976)."A model of articulatory dynamics and control." Proc.IEEE 64(4):452-459.
    [13]Dempster,A.P.,N.M.Laird,D.B.Rubin(1977)."Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society 39(1):1-38.
    [14]Donovan,R.E.(1996).Trainable speech synthesis,Cambridge University.Ph.D Dissertation.
    [15]Dudley,H.(1939)."The vocoder." Bell Labs Rec.17:122-126.
    [16]Fant,C.G.M.(1960).Acoustic theory of speech production,Mouton,The Hague,Netherlands.
    [17]Fitt,S.,S.Isard(1999).Synthesis of regional English using a keyword lexicon.Eurospeech:823-826.
    [18]Fraser,M.,S.King(2007).The Blizzard Challenge 2007.Blizzard Challenge Workshop.
    [19]Fujisaki,H.,K.Hirose(1984)."Analysis of voice fundamental frequency contours for declarative sentences of Japanese." J.Acoust.Soc.Japan(E)5:233-242.
    [20]Hirai,T.,S.Tenpaku(2004).Using 5 ms segments in concatenative speech synthesis.5th ISCA Speech Synthesis Workshop:37-42.
    [21]Hiroya,S.,M.Honda(2004)."Estimation of articulatory movements from speech acoustics using an HMM-based speech production model." IEEE Trans.Speech Audio Process.12(2):175-185.
    [22]Hogden,J.,A.Lofqvist,V.Gracco,I.Zlokarnik,P.E.Rubin,E.Saltzman(1996)."Accurate recovery of articulator positions from acoustics:New conclusions based on human data."Journal of the Acoustical Society of America 100(3):1819-1834.
    [23]Huang,X.,A.Acero,J.Adcock,H.Hon,J.Goldsmith,J.Liu,M.Plumpe(1996).Whistler:a trainable text-to-speech system.ICSLP:2387-2390.
    [24]Huang,X.,A.Acero,H.Hon(2000).Spoken Language Processing,Prentice Hall.
    [25]Hunt,A.J.,A.W.Black(1996).Unit selection in a concatenative speech synthesis system using a large speech database.ICASSP:373-376.
    [26]Itakura,F.,S.Saito(1968).Analysis synthesis telephony based on the maximum likelihood method.6th Int.Congress on Acoustics.
    [27]Itakura,F.(1990)."Line spectral representation of linear predictive coefficients." Journal of Acoustic Society of America 87(4):1738-1752.
    [28]Iwahashi,N.,N.Kaiki,Y.Sagisaka(1992).Concatenation speech synthesis by minimum distortion criteria.ICASSP:65-68.
    [29]Juang,B.,W.Chou,C.Lee(1997)."Minimum classification error rate methods for speech recognition." IEEE Transactions on Speech and Audio Processing 5:257-265.
    [30]Kawahara,H.,I.Masuda-Katsuse,A.deCheveigne(1999)."Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instanta-neous-frequency-based F0 extraction:possible role of a repetitive structure in sounds." Speech Communication 27:187-207.
    [31]King,S.,J.Frankel,K.Livescu,E.McDermott,K.Richmond,M.Wester(2007)."Speech production knowledge in automatic speech recognition." Speech production knowledge in automatic speech recognition 121(2):723-742.
    [32]Kirchhoff,K.,G.Fink,G.Sagerer(2000).Conversational speech recognition using acoustic and articulatory input.ICASSP:1435-1438.
    [33]Kiritani,S.(1986)."X-ray microbeam method for the measurement of articulatory dynamics:Technique and results." Speech Communication 45:119-140.
    [34]Klatt,D.H.(1980)."Software for a cascade/parallel formant synthesizer." J.Acoust.Soc.Am.67(3):971-995.
    [35]Klatt,D.H.(1987)."Review of texttospeech conversion for English." J.Acoust.Soc.Am.82(3):737-793.
    [36]Kullback,S.,R.A.Leibler(1951)."On information and sufficiency." Ann.Math.Stat.22:79-86.
    [37]Ling,Z.,R.Wang(2006a).HMM-based unit selection using frame sized speech segments.Interspeech:2034-2037.
    [38]Ling,Z.,Y.Wu,Y.Wang,L.Qin,R.Wang(2006b).USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method.Blizzard Challenge Workshop.
    [39]Ling,Z.,L.Qin,H.Lu,Y.Gao,L.Dai,R.Wang,Y.Jiang,Z.Zhao,J.Yang,J.Chen,G.Hu (2007a).The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007.Blizzard Challenge Workshop.
    [40]Ling,Z.,R.Wang(2007b).HMM-based hierarchical unit selection combining Kullback-Leibler divergence with likelihood criterion.ICASSP:1245-1248.
    [41]Ling,Z.,K.Richmond,J.Yamagishi,R.Wang(2008a)."Integrating articulatory features into HMM-based parametric speech synthesis." IEEE Trans.on Audio,Speech and Lang.Proc.,Submitted.
    [42]Ling,Z.,K.Richmond,J.Yamagishi,R.Wang(2008b).Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge.Interspeech,Submitted.
    [43]Ling,Z.,R.Wang(2008c).Minimum unit selection error training for HMM-based unit selection speech synthesis system.ICASSP:3949-3952.
    [44]Liu,P.,F.K.Soong(2005).Kullback-Leibler divergence between two hidden Markov models.Technical Report,Microsoft Research Asia.
    [45]Makhoul,J.(1973)."Spectral analysis of speech by linear prediction." IEEE Trans.On Acoustics,Speech and Signal Processing 21(3):140-148.
    [46]Markov,K.,J.Dang,S.Nakamura(2006)."Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework." Speech Communication 48(2):161-175.
    [47]Meyer,P.,R.Wilhelms,H.W.Strube(1993)."A quasiarticulatory speech synthe-sizer for German language running in real time." J.Acoust.Soc.Am.86(2):523-539.
    [48]Moulines,E.,F.Charpentier(1991)."Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones." Speech Communication 9:453-467.
    [49]Nakamura,K.,T.Toda,Y.Nankaku,K.Tokuda(2006).On the use of phonetic information for mapping from articulatory movements to vocal tract spectrum.ICASSP:93-96.
    [50]Nose,T.,J.Yamagishi,T.Kobayashi(2007)."A style control technique for HMM-based expressive speech synthesis." IEICE Trans.Inf.& Syst.E90-D(9):1406-1413.
    [51]Odell,J.(1995).The use of context in large vocabulary speech recognition.Ph.D Dissertation.
    [52]Qin,C.,M.A.Carreira-Perpinan(2007).A comparison of acoustic features for articulatory inversion.Interspeech:2469-2472.
    [53]Rabiner,L.(1989)."A tutorial on hidden Markov models and selected applications in speech recognition." Proc.IEEE 77:257-286.
    [54]Rabiner,L.,B.Juang(1993).Fundamentals of speech recognition,Prentice Hall.
    [55]Rahim,M.G.,C.C.Goodyear,B.Kleijn(1993)."On the use of neural networks in articulatory speech synthesis." J.Acoust.Soc.Am.92(2):1109-1121.
    [56]Richmond,K.(2007).Trajectory.mixture density networks with multiple mixtures for acoustic-articulatory inversion.NOLISP:263-272.
    [57]Rozell,C.(2003).Information-theoretic analysis of neural responses in the frequency domain.Techinical Report,Rice University.
    [58]Schonle,P.W.,K.Grabe,P.Wenig,J.Hohne,J.Schrader,B.Conrad(1987)."Electromagnetic articulography:Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract." Brain Lang.31:26-35.
    [59]Shichiri,K.,A.Sawabe,K.Tokuda,T.Masuko,T.Kobayashi,T.Kitamura(2002).Eigenvoices for HMM-based speech synthesis.ICSLP:1269-1272.
    [60]Shinoda,K.,T.Watanabe(2000)."MDL-based context-dependent subword modeling for speech recognition." J.Acoust.Soc.Japan(E)21(2):79-86.
    [61]Summerfield,Q.(1987).Some preliminaries to a comprehensive account of audio visual speech perception.Hearing by eye:the psychology of lip-reading.B.Dodd and R.Campbell,Lawrence Erlbaum Associates:3-51.
    [62]Tachibana,M.,J.Yamagishi,T.Masuko,T.Kobayashi(2005)."Speech synthesis with various emotional expressions and speaking styles by style Interpolation and morphing."IEICE Trans.Inf.& Syst.E88-D(11):2484-2491.
    [63]Taylor,P.,A.W.Black,R.Caley(1998).The architectureof the Festival speech synthesis system.3rd ESCA Workshop in Speech Synthesis:147-151.
    [64]Thomas,T.J.(1986)."A finite element model of fluid flow in the vocal tract." Computer Speech and Language 1:131-151.
    [65]Titze,I.R.(1974)."The human vocal cords:a mathematical model." Phonetic 29:1-21.
    [66]Toda,T.,W.A.Black,K.Tokuda(2008)."Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model." Speech Communication 50:215-227.
    [67]Tokuda,K.,T.kabayashi,S.Imai(1995).Speech parameter generation from HMM using dynamic features.ICASSP:660-663.
    [68]Tokuda,K.,T.Masuko,N.Miyazaki,T.Kobayashi(1999).Hidden Markov models based on multi-space probability distribution for pitch pattern modeling.ICASSP:229-232.
    [69]Tokuda,K.,H.Zen,A.W.Black(2004).HMM-based approach to multilingual speech synthesis.Text to speech synthesis:New paradigms and advances.S.Narayanan and A.Alwan,Prentice Hall.
    [70]Wang,R.,Z.Ma(2000).A corpus-based Chinese speech synthesis with contextual-dependant unit selection.ICSLP:391-394.
    [71]Wu,Y.,R.Wang(2006).Minimum generation error training for HMM-based speech synthesis.ICASSP:89-92.
    [72]Xu,C.,Y.Xu,L.Luo(1999).A pitch target approximation model for F0 contours in Mandarin.ICPhS:2359-2362.
    [73]Yamagishi,J.,M.Tamura,T.Masuko,K.Tokuda,T.Kobayashi(2003)."A context clustering technique for average voice models." IEICE Trans.Inf.& Syst.E86-D(3):534-542.
    [74]Yamagishi,J.,K.Onishi,T.Masuko,T.Kobayashi(2005)."Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis." IEICE Trans.Inf.&Syst.E88-D(3):503-509.
    [75]Yamagishi,J.,T.Kobayashi(2007)."Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training." IEICE Trans.on Inf.& Syst.E90-D(2):533-543.
    [76]Yoshimura,T.,K.Tokuda,T.Masuko,T.Kobayashi,T.Kitamura(1998).Duration modeling in HMM-based speech synthesis system.ICSLP:29-32.
    [77]Young,S.,G.Evermann,D.Kershaw,G.Moore,J.Odell,D.Ollason,D.Povey,V.Valtchev,P.Woodland(2002).The HTK Book(for HTK version 3.2),Cambridge University Engineering Department.
    [78]Zen,H.,T.Nose,J.Yamagishi,S.Sako,T.Masuko,A.W.Black,K.Tokuda(2007a).The HMM-based speech synthesis system(HTS)version 2.0.6th ISCA Workshop on Speech Synthesis:294-299.
    [79]Zen,H.,T.Toda,M.Nakamura,a.K.Tokuda(2007b)."Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005." IEICE Trans.Inf.& Syst.E90-D(1):325-333.
    [80]蔡莲红,黄德智,蔡锐(2003).现代语音技术基础与应用,清华大学出版社.
    [81]陈永彬,王仁华(1990).语音信号处理,中国科技大学出版社.
    [82]蒋尔雄(1988).线性代数,人民教育出版社.
    [83]凌震华,王仁华(2007).”基于统计声学模型的单元挑选语音合成算法.”模式识别与人工智能,已录用.
    [84]史忠植(2001).知识发现,清华大学出版社.
    [85]吴义坚(2006).基于隐马尔可夫模型的语音合成技术研究,中国科学技术大学.博士论文.
    [86]杨行峻,迟惠生(1995).语音信号数字处理,电子工业出版社.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700