汉语耳语音转换为正常语音的共振峰结构研究

英文题名：Research on Formants Structure in Speech Reconstruction from Chinese Whispers
作者：刘建新
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：正常语音 ; 耳语音 ; 共振峰 ; 线性预测编码 ; 轨迹跟踪 ; 高斯混合模型
英文关键词：Normal speech ; Whispered speech ; Formant ; LPC ; Tracking ; GMM
学位年度：2007
导师：赵鹤鸣
学科代码：081002
学位授予单位：苏州大学
论文提交日期：2007-05-01

摘要

耳语音是一种特殊的语音交流方式。耳语音转换为正常语音的研究在理论上具有重要的科学价值,同时也可应用于公众场合下的通讯、失音者的语音恢复和公安司法工作等实际应用方面。
     本文分析了正常语音和耳语音清浊音在短时能量、短时平均幅度、短时过零率、短时自相关以及短时平均幅度差函数等方面的异同,总结出耳语音与正常语音在时域上的差异。
     线性预测编码算法是提取语音共振峰最有效的方法之一,但是准确提取共振峰存在虚假峰和合并峰的问题。通过分析极点交叉现象以及处理措施,基于语音识别中共振峰的谱密度比其带宽更重要这一事实,本文提出一种算法,即基于极点交叉的LPC改进算法,通过修改共振峰极点半径以减小极点交叉引起的误差来准确提取共振峰。通过实验分析得知,这种算法不仅能够准确提取语音的共振峰参数,而且可以解决共振峰提取过程中存在的虚假峰和合并峰的问题,同时该算法对含噪语音共振峰的提取具有鲁棒性。
     通过分析共振峰轨迹跟踪曲线,得出了男声和女声的正常语音以及耳语音共振峰频率的平均值,得出耳语音和正常语音在共振峰结构上存在的差异。
     基于映射规则而言,高斯混合模型具有较好的性能和鲁棒性。本文在高斯混合模型的基础上建立起汉语耳语音LSF参数转换为正常语音LSF参数的映射模型,实现了汉语耳语音向正常语音的转换。
Whispered speech is a special kind of speech communication.Reconstruction of normal speech from Chinese whispered speech have an important scientific value.On the other hand,it can be applied in several fields,such as the private speech communication in public,reconstruction of normal speech for the aphonic individuals,the special need for the forensic work,etc.
     In this dissertation,the differences between voiced sound and unvoice sound of normal speech and whispered speech are indicated,such as short-time energy,short-time average magenitude,short-time average zero-crossing rate,short-time autocorrelation function,short -time average magnitude difference function.The differences between normal speech and whispered speech are indicated in time domain.
     Linear Prediction Coding(LPC) is an efficient algorithm in extracting formant paremeters of speech,but it exists the questions of spurious peaks and merging peaks.It is well known that the formant spectral density is more important than the formant bandwidth.Based on the principle of pole interaction,a new LPC improved algorithm is proposed.It can extract the formants effectively by modifying the poles’radius.The experimental results show that the proposed method can extract the formants effectively,it can solve the two questions.At the same time,this algorithm is robust for noise speech. From the formants tracking curve,it can detain mean of speech formants.From them,the differences between normal speech and whispered speech are analyzed.
     Gaussian Mixture Model(GMM) is more robust and better performance than any other model based on mapped rules.Based on this model,a mapping rule is constructed by LSF from Chinese whispered speech to normal speech.

引文

[1] R L 特拉斯克.《语音学和音系学字典》(A dictionary of Phonetics and Phonology).《语音学和音系学字典》编译组译,语文出版社,2000:26.
    [2] M F Schwartz,H E Rine.Identification of speaker sex from isolated,whispered vowels. Journal of Acoustical Society of American,1968,44(6):1736-1737.
    [3] N J Lass,K R Hughes,M D Bowyer,et al.Speaker sex identification from voiced, whispered,and filtered isolated vowels.Journal of Acoustical Society of American, 1976,59(3):675-678.
    [4] V C Tartter.What’s in a whisper?.Journal of Acoustical Society of American,1989,86 (5):1678-1683.
    [5] V C Tartter.Identifiability of vowels and speakers from whispered syllables.Perception and Psychophysics,1991,49(4):365-372.
    [6] W Meyer-Eppler.Realization of prosodic features in whispered speech.Journal of Acoustical Society of American,1957,29(1):104-106.
    [7] I B Thomas.Perceived pitch of whispered vowels.Journal of Acoustical Society of American,1969,46(2):468-470.
    [8] M Higashikawa,K Nakai.Perceived pitch of whispered vowels-ralationship with formant frequencies:a preliminary study.Journal of Voice,1996,10(2):155-158.
    [9] K J Kallail,F W Emanuel.Formant frequency differences between isolated and phonated of vowel samples produced by adult female subjects.Journal of Speech and Hearing Research,1984,27(2):245-251.
    [10] I Elund,H Traunmuller.Comparative study of male and female whispered and phonated versions of the long vowels of Swedish.Phonetica,1997,54:1-21.
    [11] 于华.耳语不利于声嘶治疗和嗓音恢复.中央民族大学学报,1996,5(2):163-166.
    [12] T Itoh,K Takeda,F Itakura.Acoustic analysis and recogniton of whisper speech. Proc.ICASSP,Orlando,Florida,USA,2002:389-392.
    [13] R W Morris.Enhancement and recogniton of whispered speech.Georgia Institute of Technology,USA,2002.
    [14] R W Morris,M A Clements.Reconstruction of speech from whispers.Medical Engineering&Physics,2002,24(8):515-520.
    [15] M Gao.Tones in whispered Chinese:articulatory and perceptual cues.University of Victoria,Canada,2002.
    [16] 栗学丽,徐柏龄.混响声场中语音识别方法研究.南京大学学报(自然科学版),2003,39(4):525-531.
    [17] 梁赐芳,吴晓钟,李成添,谭晓辉.电子人工喉在无喉患者发声中的应用.中华耳鼻咽喉科杂志,1997,32(3):151-152.
    [18] L M Aralan.Speaker transformation algorithm using segmental codebooks.Speech Communication,1999,28:211-226.
    [19] N Bi,Y Y Qi.Application of speech conversion to alaryngeal speech enhancement. IEEE Transaction on Speech and Audio Processing,1997,(5)20:97-105.
    [20] 齐颖勇,张家騄.用线性预测编码合成改善气管食管语音.声学学报,1991,16(5): 344-351.
    [21] 赵力.语音信号处理.北京:机械工业出版社,2003.
    [22] M F Sciiwartz.Syllable duration in oral and whispered reading. Journal of Acoustical Society of American,1967,41(5):1367-1369.
    [23] 杨莉莉,李燕,徐柏龄.汉语耳语音库的建立与听觉实验研究.南京大学学报(自然科学版),2005,41(5):311-317.
    [24] 沙丹青,栗学丽,徐柏龄.耳语音声调特征的研究.电声基础,2003,11:4-7.
    [25] Slobodan T Jovicic.Formant feature differences between whispered and voiced sustained vowels.ACUSTICA,1998,84:739-743.
    [26] M Matsuda,H Kasuya.Acoustic nature of the whisper.EUROSPEECH,1999:137-140.
    [27] K J Kakkail,F W Emanuel.Formant frequency differences between isolated whispered and phonated of vowel samples produced by adult female subjects.Journal of Speech and Hearing Research,1984,27(2):245-251.
    [28] I Eklund,H Traunmvller.Comparative study of male and female whispered and phonated versions of the long vowels of Swedish.Phonetica,1997,54:1-21.
    [29] S S McCandless.An algorithm for automatic formant extraction using linear prediction spectra.IEEE Trans.on ASSP,1974,ASSP-22(2):135-141.
    [30] Zhao Qifang,T Shimamura,J Suzuki.A robust algorithm for formant frequency extraction of noisy speech.ISCAS,1998,5:534-537.
    [31] L Welling, H Ney.A model for efficient formant estimation.ICASSP,1996,2:797-800.
    [32] A M Lima Araujo,F Violaro.Formant frequency estimation using a mel scale LPC algorithm.ITS,1998,1:207-212.
    [33] P Zolfaghari,T Robinson.Formant analysis using mixtures of Gaussians.ICSLP, 1996,2:1229-1232.
    [34] 章文义,朱杰,陈斐利.一种新的共振峰参数提取算法及在语音识别中的应用.计算机工程,2003,29(13):67-68.
    [35] T Ito,K Takeda,F Itakura.Analysis and recognition of whispered speech.Speech Communication,2005,45(2):139-152.
    [36] Y S Hsiao,D G Childers.A new approach formant estimation and modification based on pole interaction.ACSSC,1997,1:783-787.
    [37] H Kuwabara,K Ohgushi.Contributions of vocal tract resonant frequencies and bandwidths to the personal perception of speech.Acoustica,1987,63(2):120-128.
    [38] B Gold,L Rabiner.Parallel processing techniques for estimating pitch periods of speech in the time domain.Journal of the Acoustical Society of America,1969,46:442-448.
    [39] Gael Richard,Christophe Alessandro.Analysis/synthesis and modification of the speech aperiodic component.Speech Communication,1996,19:221-243.
    [40] B S Atal,L Suzanne.Hanauer.Speech analysis and synthesis by linear prediction of the speech wave.Journal of the Acoustical Society of America,1971:637-655.
    [41] D G. Childers,C K Lee.Vocal quality factors:analysis,synthesis,and perception. Journal of the Acoustical Society of America,1991,90(5):2394-2410.
    [42] J P Olive.Automatic formant tracking by a New-Raphson technique. Journal of the Acoustical Society of America,1971,50(2):661-670.
    [43] 拉宾纳,谢弗,朱雪龙等译.语音信号数字处理.北京:科学出版社,1983:318-321.
    [44] G. K Vallabha,B Tuller.Systematic errors in the formant analysis of steady-state vowels.Speech Communication,2002,38:141-160.
    [45] A Acero.Formant analysis and synthesis using Hidden Markov Models.Eurospeech, 1999.
    [46] P Schmid,E Barnard.Robust N-best formant tracking.Eurospeech,1995:737-740.
    [47] K Xia,C Espy-Wilson.A new strategy of formant tracking based on dynamic programming.IOSLP,2000,3:55-58.
    [48] Issam Bazzi,Alex Acero,Li Deng.An expectation maximization approach for formant tracking using a parameter-free non-linear predictor.ICASSP,2003,1:464-467.
    [49] Li Deng,Alex Acero.Tracking vocal tract resonances using a quantized nonlinear function embedded in a temporal constraint.IEEE TASLP,2006,14(2):425-434.
    [50] Kamran Mustafa,C Ian.Bruce.Robust formant tracking for continuous speech with speaker variability.IEEE TASLP,2006,14(2):435-444.
    [51] 左国玉,刘文举,阮晓钢.声音转换技术的研究与进展.电子学报,2004,32(7):1165- 1171.
    [52] R W Morris.Modification of formants in the line spectrum domain.IEEE Signal Processing Letters,2002,9(1):19-21.
    [53] 栗学丽.汉语耳语音转换为正常音的研究.南京大学博士论文,2004.
    [54] M Abe,S Nakamura,K Shikano,H Kuwabara.Voice conversion through vector quantization..In Proceedings of ICASSP,1988:655-658.
    [55] H Strik,L Boves.On the relation between voice source parameters and prosodic features in connected speech.Speech Communication,1992, 11:167-174.
    [56] Anisa Yasmin,Paul Fieguth,Li Deng.Speech enhancement using voice source models.ICASSP,1999,2:797-802.
    [57] E Moulines, W Verhelst.Time-domain and frequency-domain techniques forprosodic modification of speech.Speech Coding And Synthesis.1998:519-555.
    [58] Hideki Kawahara,Ikuyo Masuda-Katsuse,Alain de Cheveigne.Restructuring speech representations using a pitch-adapative time-frequency smoothing and an instantaneous-frequency-based F0 extraction:Possible role of a repetitive structure in sounds.Speech Communication,1999,27:187-207.
    [59] 吕声,尹俊勋,黄建成.基于高斯混合模型和残差预测的说话人转换系统.电声技术,2004,6:33-36.
    [60] Taisuke Ito,Kazuya Takeda,Fumitada Itakura.Analysis and recognition of whispered speech.Speech Communication,2005,45:139-152.
    [61] 吕声.说话人转换方法的研究.华南理工大学博士论文,2004.
    [62] H Kuwabara,Y Sagisaka.Acoustic characteristics of speaker individuality: Control and conversion. Speech Communication.1995,16(2):165-173.
    [63] L M Arslan.Speaker transformation algorithm using segmental codebooks.Speech Communication,1999, 28(3):211-226.
    [64] H Valbret,E Moulines,J P Tubach.Voice transformation using PSOLA technique. Speech Communication,1992,11(2):175-187.
    [65] H Mizuno, M Abe.Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt. Speech Communication,1995,16(2): 153-164.
    [66] M Narendranath,H A Murthy,S Rajendran.Transformation of formants for voice conversion using artificial neural networks.Speech Communication,16(2):207-216.
    [67] R W Morris,A Mark.Clements.Reconstruction of speech from whispers.Medical Engineering&Physics,2002,24:515-520.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700