A uniform phase representation for the harmonic model in speech synthesis applications
详细信息    查看全文
  • 作者:Gilles Degottex (1) (2)
    Daniel Erro (3) (4)

    1. Computer Science Department
    ; University of Crete (UOC-CSD) ; Heraklion ; 71003 ; Greece
    2. Institute of Computer Science
    ; Foundation for Research and Technology-Hellas (FORTH-ICS) ; Heraklion ; 71110 ; Greece
    3. Basque Science Foundation (IKERBASQUE)
    ; Bilbao ; 48013 ; Spain
    4. Aholab
    ; University of the Basque Country ; Bilbao ; 48013 ; Spain
  • 关键词:Speech synthesis ; Harmonic model ; Phase modeling ; Voice transformation ; Parametric speech synthesis
  • 刊名:EURASIP Journal on Audio, Speech, and Music Processing
  • 出版年:2014
  • 出版时间:December 2014
  • 年:2014
  • 卷:2014
  • 期:1
  • 全文大小:2,964 KB
  • 参考文献:1. Gales, MJF, Young, SJ (2007) The application of hidden Markov models in speech recognition. Foundations Trends Signal Process 1: pp. 195-304 CrossRef
    2. Kinnunen, T, Li, H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52: pp. 12-40 CrossRef
    3. Stylianou, Y (1996) Harmonic plus noise models for speech combined with statistical methods, for speech and speaker modification. PhD thesis,. TelecomParis, France
    4. Tokuda, K, Nankaku, Y, Toda, T, Zen, H, Yamagishi, J, Oura, K (2013) Speech synthesis based on hidden markov models. Proc. IEEE 101: pp. 1234-1252 CrossRef
    5. Anguera, X, Bozonnet, S, Evans, N, Fredouille, C, Friedland, O, Vinyals, O (2012) Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process 20: pp. 356-370 CrossRef
    6. Davis, SB, Mermelstein, P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process 28: pp. 357-366 CrossRef
    7. Spanias, AS (1994) Speech coding: a tutorial review. Proc. IEEE 82: pp. 1541-1582 CrossRef
    8. Scott, JM, Assmann, PF, Nearey, TM (2001) Intelligibility of frequency shifted speech. J. Acoust. Soc. Am 109: pp. 2316-2316 CrossRef
    9. Schweinberger, SR, Casper, C, Hauthal, N, Kaufmann, JM, Kawahara, H, Kloth, N, Robertson, DM, Simpson, AP, Z盲ske, R (2008) Auditory adaptation in voice perception. Curr. Biol 6: pp. 684-688 CrossRef
    10. Kawahara, H, Masuda-Katsuse, I, de Cheveigne, A (1999) Restructuring speech representations using a pitch-adaptative time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun 27: pp. 187-207 CrossRef
    11. McAulay, R, Quatieri, T (1986) Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust. Speech Signal Process 34: pp. 744-754 CrossRef
    12. T Quatieri, RJ McAulay, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 10. Speech transformations based on a sinusoidal representation (Tampa, Florida, USA, 1985), pp. 489鈥?92.
    13. Quatieri, TF, McAulay, R (1992) Shape invariant time-scale and pitch modification of speech. IEEE Trans. Signal Process 40: pp. 497-510 CrossRef
    14. TF Quatieri, R McAulay, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP). Phase coherence in speech reconstruction for enhancement and coding applications (Glasgow, Scotland, UK, 23鈥?6 May 1989), pp. 207鈥?10.
    15. Pantazis, Y, Rosec, O, Stylianou, Y (2010) Adaptive AM-FM signal decomposition with application to speech analysis. IEEE Trans. Audio Speech Lang. Process 19: pp. 290-300 CrossRef
    16. Degottex, G, Stylianou, Y (2013) Analysis and synthesis of speech using an adaptive full-band harmonic model. IEEE Trans. Audio Speech Lang. Proc 21: pp. 2085-2095 CrossRef
    17. G Kafentzis, G Degottex, O Rosec, Y Stylianou, in / Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Time-scale modifications based on a full-band adaptive harmonic model (Vancouver, 26鈥?1 May 2013), pp. 8193鈥?197.
    18. G Kafentzis, G Degottex, O Rosec, Y Stylianou, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. ICASSP. Pitch modifications of speech based on an adaptive harmonic model (Florence, 4鈥? May 2014).
    19. J Laroche, Y Stylianou, E Moulines, in / Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2. HNS: Speech modification based on a harmonic+noise model (Minneapolis, USA, 27鈥?0 Apr 1993), pp. 550鈥?53.
    20. Richard, G, d鈥橝lessandro, C (1996) Analysis/synthesis and modification of the speech aperiodic component. Speech Commun 19: pp. 221-244 CrossRef
    21. E Banos, D Erro, A Bonafonte, A Moreno, in / Proc. V Jornadas en Tecnologias del Habla. Flexible harmonic/stochastic modelling for HMM-based speech synthesis (Bilbao, Spain, 12鈥?4 Nov 2008).
    22. El-Jaroudi, A, Makhoul, J (1991) Discrete all-pole modeling. IEEE Trans. Signal Process 39: pp. 411-423 CrossRef
    23. Campedel-Oudot, M, Cappe, O, Moulines, E (2001) Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach. IEEE Trans. Speech Audio Process 9: pp. 469-481 CrossRef
    24. Saratxaga, I, Hernaez, I, Erro, D, Navas, E, Sanchez, J (2009) Simple representation of signal phase for harmonic speech models. Electron. Lett 45: pp. 381-383 CrossRef
    25. G Degottex, A Roebel, X Rodet, in / Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Function of phase-distortion for glottal model estimation (Prague, 22鈥?7 May 2011), pp. 4608鈥?611.
    26. Y Ohtani, M Tamura, M Morita, T Kagoshima, M Akamine, in / Proc. Interspeech. HMM-based speech synthesis using sub-band basis spectrum model (Portland, Oregon, USA, September 9鈥?3 2012), pp. 1440鈥?443.
    27. Maia, R, Akamine, M, Gales, MJF (2013) Complex cepstrum for statistical parametric speech synthesis. Speech Commun 55: pp. 606-618 CrossRef
    28. H Kawahara, J Estill, O Fujimura, in / Proc. Second International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA). Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight (Florence, Italy, 2001).
    29. Erro, D, Sainz, I, Navas, E, Hernaez, I (2014) Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Selected Topics Signal Process 8: pp. 184-194 CrossRef
    30. J Latorre, MJF Gales, S Buchholz, K Knill, M Tamurd, Y Ohtani, M Akamine, in / Proc.IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? (Prague, 22鈥?7 May 2011), pp. 4724鈥?727.
    31. Degottex, G, Lanchantin, P, Roebel, A, Rodet, X (2013) Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis. Speech Comm 55: pp. 278-294 CrossRef
    32. Tokuda, K, Masuko, T, Myizaki, N, Kobayashi, T (2002) Multi-space probability distribution HMM. IEICE Trans. Inform. Syst E85-D: pp. 455-464
    33. I Saratxaga, I Hernaez, M Pucher, I Sainz, in / Proc. Interspeech. Perceptual importance of the phase related information in speech (ISCAPortland, Oregon, USA, September 9鈥?3 2012).
    34. Mowlaee, P, Saeidi, R (2013) Iterative closed-loop phase-aware single-channel speech enhancement. Signal Process. Lett. IEEE 20: pp. 1235-1239 CrossRef
    35. P Mowlaee, R Saiedi, R Martin, in / Proceedings of the International Conference on Spoken Language Processing. Phase estimation for signal reconstruction in single-channel speech separation, (2012), pp. 1鈥?.
    36. Miller, RL (1959) Nature of the vocal cord wave. J. Acoust. Soc. Am 31: pp. 667-677 CrossRef
    37. Oppenheim, AV, Schafer, RW (1978) Digital Signal Processing. Prentice-Hall, New Jersey, USA
    38. Paul, DB (1981) The spectral envelope estimation vocoder. IEEE Trans. Acoust. Speech Signal Process 29: pp. 786-794 CrossRef
    39. B Doval, C d鈥橝lessandro, N Henrich, in / Proc. ISCA Voice Quality: Functions, Analysis and Synthesis (VOQUAL). The voice source as a causal/anticausal linear filter (Geneva, 27鈥?9 Aug 2003), pp. 16鈥?0.
    40. Degottex, G, Roebel, A, Rodet, X (2011) Phase minimization for glottal model estimation. IEEE Trans. Audio Speech Lang. Process 19: pp. 1080-1090 CrossRef
    41. Oppenheim, A, Schafer, R, Stockham, T (1968) Nonlinear filtering of multiplied and convolved signals. Proc. IEEE 56: pp. 1264-1291 CrossRef
    42. B Bozkurt, B Doval, C d鈥橝lessandro, T Dutoit, in / Proc. International Conference on Spoken Language Processing (ICSLP). Zeros of Z-transform (ZZT) decomposition of speech for source-tract separation (South Korea, Japan, 4鈥? Oct 2004).
    43. T Drugman, B Bozkurt, T Dutoit, in / Proc. Interspeech. Complex cepstrum-based decomposition of speech for glottal source estimation (Brighton, UK, 6鈥?0 Sep 2009), pp. 116鈥?19.
    44. T Drugman, T Dubuisson, A Moinet, C d鈥橝lessandro, T Dutoit, in / Proc. International Conference on Signal Processing and Multimedia Applications (SIGMAP). Glottal source estimation robustness (Porto, Portugal, 26鈥?9 Jul 2008).
    45. Laroche, J, Dolson, M (1999) Improved phase vocoder time-scale modification of audio. IEEE Trans. Speech Audio Process 7: pp. 323-332 CrossRef
    46. Stylianou, Y (2001) Removing linear phase mismatches in concatenative speech synthesis. IEEE Trans. Speech Audio Process 9: pp. 232-239 CrossRef
    47. Agiomyrgiannakis, Y, Stylianou, Y (2009) Wrapped gaussian mixture models for modeling and high-rate quantization of phase data of speech. IEEE Trans. Audio Speech Lang. Proc 17: pp. 775-786 CrossRef
    48. Smits, R, Yegnanarayana, B (1995) Determination of instants of significant excitation in speech using group delay function. IEEE Trans. Speech Audio Process 3: pp. 325-333 CrossRef
    49. Ananthapadmanabha, T, Yegnanarayana, B (1979) Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust Speech Signal Process 27: pp. 309-319 CrossRef
    50. Moulines, E, Charpentier, F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9: pp. 453-467 CrossRef
    51. C Hamon, E Mouline, F Charpentier, in / Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. A diphone synthesis system based on time-domain prosodic modifications of speech (Glasgow, 23鈥?6 May 1989), pp. 238鈥?41.
    52. Lipshitz, SP, Pocock, M, Vanderkooy, J (1982) On the audibility of midrange phase distortion in audio systems. J. Audio Eng. Soc 30: pp. 580-595
    53. Hansen, V, Madsen, ER (1974) On aural phase detection: Part 1. J. Audio Eng. Soc 22: pp. 10-14
    54. Hansen, V, Madsen, ER (1974) On aural phase detection: Part 2. J. Audio Eng. Soc 22: pp. 783-788
    55. M Tahon, G Degottex, L Devillers, in / Proc. International Conference on Speech Prosody. Usual voice quality features and glottal features for emotional valence detection (Shanghai, China, 22鈥?5 May 2012), pp. 693鈥?96.
    56. Banno, H, Takeda, K, Itakura, F (2002) The effect of group delay spectrum on timbre. Acoust. Sci. Technol 23: pp. 1-9 CrossRef
    57. Yegnanarayana, B, Saikia, D, Krishnan, T (1984) Significance of group delay functions in signal reconstruction from spectral magnitude or phase. Acoust. Speech Signal Process. IEEE Trans 32: pp. 610-623 CrossRef
    58. Murthy, HA, Yegnanarayana, B (1991) Speech processing using group delay functions. Elsevier Signal Process 22: pp. 259-267 CrossRef
    59. D Zhu, KK Paliwal, in / Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. Product of power spectrum and group delay function for speech recognition (Montreal, Quebec, Canada, 17鈥?1 May 2004), pp. 125鈥?1.
    60. Naylor, PA, Kounoudes, A, Gudnason, J, Brookes, M (2007) Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process 15: pp. 34-43 CrossRef
    61. T Drugman, T Dubuisson, T Dutoit, in / IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Phase-based information for voice pathology detection (Prague, 22鈥?7 May 2011), pp. 4612鈥?615.
    62. Y Shiga, S King, in / Proc. Eurospeech, 3. Estimation of voice source and vocal tract characteristics based on multi-frame analysis (Geneva, 1鈥? Sept 2003), pp. 1749鈥?752.
    63. J Bonada, in / Proc. Digital Audio Effects (DAFx). High quality voice transformations based on modeling radiated voice pulses in frequency domain (Naples, Italy, 5鈥? Oct 2004).
    64. Fisher, NI (1995) Statistical Analysis of Circular Data. Cambridge University Press, UK
    65. RJ McAulay, TF Quatieri, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 1. Sine-wave phase coding at low data rates (Toronto, 14鈥?7 May 1991), pp. 577鈥?80.
    66. R McAulay, TF Quatieri, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 12. Multirate sinusoidal transform coding at rates from 2.4 kbps to 8 kbps (Dallas, Texas, USA, 1987), pp. 1645鈥?648.
    67. A Sugiyama, R Miyahara, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP). Phase randomization - a new paradigm for single-channel signal enhancement (Vancouver, 26鈥?1 May 2013), pp. 7487鈥?491.
    ITU-T P.862: Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Technical report, International Telecommunication Union (ITU), Geneva, Switzerland
    ITU-R BS.1284-1: General methods for the subjective assessment of sound quality. Technical report, International Telecommunication Union (ITU), Geneva, Switzerland
    68. G Degottex, D Erro, Demonstrations audio samples of HMPD-based synthesis (2014). http://gillesdegottex.eu/ExDegottexG2014jhmpd. Accessed 9 Oct 2014.
    69. Doval, B, d鈥橝lessandro, C, Henrich, N (2006) The spectrum of glottal flow models. Acta Acustica United Acustica 92: pp. 1026-1046
    70. H Zen, T Nose, J Yamagishi, S Sako, T Masuko, A Black, K Tokuda, in / Proc. ISCA Workshop on Speech Synthesis, SSW 2007. The HMM-based speech synthesis system (HTS) version 2.0 (Bonn, 22鈥?4 Aug 2007).
    71. Zen, H, Toda, T, Nakamura, M, Tokuda, K (2007) Details of the nitech HMM-based speech synthesis system for the blizzard challenge 2005. IEICE Trans. Inf. Syst E90-D: pp. 325-333 CrossRef
    72. Yu, K, Young, S (2011) Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process 19: pp. 1071-1079 CrossRef
    73. I Sainz, D Erro, E Navas, I Hern谩ez, J S谩nchez, I Saratxaga, I Odriozola, in / Proc. of European Language Resources Association (ELRA). Versatile speech databases for high quality synthesis for basque (Istanboul, Turkey, 23鈥?5 May 2012).
    74. Rodriguez-Banga, E, Garcia-Mateo, C (2010) Documentation of the UVIGO_ESDA Spanish database. Technical report, Universidade de Vigo, Vigo, Spain
    75. J Kominek, AW Black, in / Proc. ISCA Speech Synthesis Workshop. The CMU ARCTIC speech databases (Geneva, Switzerland, 1鈥? Sep 2003), pp. 223鈥?24.
    76. Cooke, M, Mayo, C, Valentini-Botinhao, C, Stylianou, Y, Sauert, B, Tang, Y (2013) Evaluating the intelligibility benefit of speech modifications in known noise conditions. Speech Commun 55: pp. 572-585 CrossRef
    77. D Erro, I Sainz, E Navas, I Hernaez, in / Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP). HNM-based MFCC+F0 extractor applied to statistical speech synthesis (Prague, 22鈥?7 May 2011), pp. 4728鈥?731.
    78. P Lanchantin, G Degottex, X Rodet, in / Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). A HMM-based speech synthesis system using a new glottal source and vocal-tract separation method (Dallas, 14鈥?9 March 2010), pp. 4630鈥?633.
  • 刊物主题:Signal, Image and Speech Processing;
  • 出版者:Springer International Publishing
  • ISSN:1687-4722
文摘
Feature-based vocoders, e.g., STRAIGHT, offer a way to manipulate the perceived characteristics of the speech signal in speech transformation and synthesis. For the harmonic model, which provide excellent perceived quality, features for the amplitude parameters already exist (e.g., Line Spectral Frequencies (LSF), Mel-Frequency Cepstral Coefficients (MFCC)). However, because of the wrapping of the phase parameters, phase features are more difficult to design. To randomize the phase of the harmonic model during synthesis, a voicing feature is commonly used, which distinguishes voiced and unvoiced segments. However, voice production allows smooth transitions between voiced/unvoiced states which makes voicing segmentation sometimes tricky to estimate. In this article, two-phase features are suggested to represent the phase of the harmonic model in a uniform way, without voicing decision. The synthesis quality of the resulting vocoder has been evaluated, using subjective listening tests, in the context of resynthesis, pitch scaling, and Hidden Markov Model (HMM)-based synthesis. The experiments show that the suggested signal model is comparable to STRAIGHT or even better in some scenarios. They also reveal some limitations of the harmonic framework itself in the case of high fundamental frequencies.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700