基于EMD的说话人识别研究

英文题名：Speaker Recognition Based on EMD
作者：刘亚丽
论文级别：硕士
学科专业名称：电路与系统
中文关键词：说话人识别 ; 希尔伯特黄变换 ; 经验模态分解法 ; 希尔伯特谱加权mel倒谱系数 ; SVM识别模型 ; GMM识别模型
英文关键词：speaker recognition ; Hilbert-Huang Transform ; Empirical Mode Decomposition ; Hilbert Spectral Analysis ; marginal spectrum ; WMCEP ; SVM ; GMM
学位年度：2010
导师：杨鸿武
学科代码：080902
学位授予单位：西北师范大学
论文提交日期：2010-05-01

摘要

在生物认证领域,说话人识别以其独特的优势——方便性、经济性、准确性,逐渐成为人们日常生活工作中至关重要的身份认证方式,并已被广泛地应用于电子商务、司法等安全领域,是当前的一个研究热点。说话人的特征参数是构建说话人识别系统的基础。当前,大多数研究中提取说话人特征参数均是应用短时分析方法(傅里叶变换法),但是说话人的语音信号是典型的非线性信号,采用线性信号分析方法势必会丢失一些重要的信息。针对此种情况,本论文展开了一系列的研究,主要工作与创新如下：
     第一：论文首先改进了现有的特征参数。采用感知加权技术,选择基于心理声学模型计算得到的信号掩蔽比插值作为权重函数,并将权重函数应用到mel倒谱分析中获得加权mel倒谱系数(WMCEP),实验中将WMCEP结合GMM识别模型进行说话人识别研究。
     第二：论文引入了非线性信号分析方法--希尔伯特黄变换(Hilbert-Huang Transform, HHT),其组成部分是经验模态分解法(Empirical Mode Decomposition, EMD)和希尔伯特谱分析(Hilbert Spectral Analysis, HSA)。应用EMD分解法并结合短时分析技术,处理语音信号,提出了三种特征提取算法。实验中选用了适用于分类问题的SVM识别模型并结合提出的特征参数应用到说话人识别中；同时为了比较分析SVM的识别性能,将GMM识别模型作为比较模型。
     第三：论文着重从理论分析的角度研究了基于EMD分解法提取特征参数的可行性和有效性。采用的分析方法是基于HSA谱和边界谱的EMD特征提取以及基于残差相位的EMD特征提取。
     EMD分解法的引入是一种新的尝试,本论文基于此提出的特征提取方法具有一定的理论依据和较好的实用效果,为今后的语音识别和说话人识别研究提供了一定的研究基础。
During the biometric systems, speaker recognition has been becoming a prominent recognizing way, based on its convenience, economy and accuracy. Currently, speaker recognition has been widely applied to electronic business, helpdesks, forensics, telephone banking and etc. Speaker features are basic to speaker recognition system. Nowadays, most of studies extract speaker features by short-time analysis method, or Fourier transform. However, speech signal is of typical non-linear signal. In fact, using linear-signal analysis to extract speaker features cannot avoid ignoring some important information. Facing this situation, this paper has done a series of studies, the main work are as follows:
     First, this paper improves the present features. This paper applies the psychologically weighted technology in mel-cepstrum analysis and adopted the Signal-to-Mask Ratios (SMRS) obtained from psychoacoustic model as weighting function to acquire the weighted mel-cepstrum coefficients (WMCEP).
     Secondly, this paper introduces the non-linear signal analysis, or Hilbert-Huang Transform (HHT), which is composed of Empirical Mode Decomposition (EMD) and Hilbert Spectral Analysis (HSA). The EMD, together with the short-time analysis, is used to analyze speech signals to extract three kinds of speaker features. In the experiments, SVM model is applied to speaker recognition. On the stage of training, it builds speaker models, while on the stage of predicting, it compares the features with speaker models built during training. In order to express SVM's classification abilities, this paper also uses GMM as comparison model.
     Thirdly, from the perspective of theoretical analysis, this paper analyzes the feasibility and effectiveness of the method, which is to extract speaker features by combination of EMD and short-time analysis. The analysis methods are based on two theories; one is the HSA (Hilbert Spectral Analysis) spectrum and marginal spectrum, while the other is residual phase.
     As a new try, this paper applies the EMD to propose new ways to extract speaker features, which has certain theoretical basis and practical effect. Most importantly, it is good for the future study of auto speech recognition and speaker recognition.

引文

[1]郑方.声纹识别与数据安全.[CP/OL]. http://tech.sina.com.cn/o/2003-04-16/1020178442. shtml,2003.
    [2]吴玺宏.声纹识别听声辨认[J].计算机世界,2001,(8).
    [3]解焱陆.基于特征变换和分类的文本无关电话语音说话人识别研究[D].北京：中国科学技术大学,2007.
    [4]FURUI S. Cepstral analysis technique for automatic speaker verification [J]. IEEE Trans. Acoustic, Speech Signal Process,1981,29(2):254-272.
    [5]ZHANG Wan-feng, YANG Ying-chun, WU Zhao-hui, SANG Li-feng. Experimental evaluation of a new speaker identification framework using PCA[C].//IEEE International Conference on systems, Man and Cybernetics, VOL.5,2003,4147-4152.
    [6]ISMAIL S. Speaker recognition in the shouted environment using suprasegmental Hidden Markov Model [J]. Signal Processing,2008,88(11):2700-2708.
    [7]FARRELL K, MAMMONE R, ASSALEH K. Speaker recognition using neural networks and conventional classifiers [J]. IEEE Trans. Speech Audio Process,1994,2(l):194-205.
    [8]DAVIS S, MERMELSTEIN P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences [J]. IEEE Trans. Acoustic, Speech Signal Process,1980,28(4):357-366.
    [9]REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted gaussian mixture models [J]. Digital Signal Processing,2000,10(19-41).
    [10]吴朝晖,杨莹春.说话人识别模型与方法[M].北京：清华大学出版社,2009.3-5.
    [11]http://www.d-ear.com/default.asp? Classid=3&id=9.
    [12]http://www.ctiforum.com/factory/software/www.finesupport.com/finesupport01_0802.htm, 2001.
    [13]http://www.pattek.com.cn/Product.asp? Id=33.
    [14]杨行俊,迟惠生等.语音信号数字处理[M].北京：电子工业出版社,1995.
    [15]SOONG F K, ROSENBERG A E, JUANG B H, RABINER L R. A vector quantization approach to speaker recognition. AT & T Technical,1987,66:14-26.
    [16]VAPNIK V N. The Nature of Statistical Learning Theory [M]. New Work:Springer Verlag, 1995.
    [17]KINNUNEN T, LI H. An overview of text-independent speaker recognition:from features to supervectors [J]. SPEECH COMMUNICATION,2010,(52):12-40.
    [18]胡航.语音信号处理[M].哈尔滨：哈尔滨工业大学出版社,2005,5-21.
    [19]CHIN-CHUNG CHANG, CHIH-JEN, LIN. LIBSVM:a library for support vector machines. Software available [CP/OL]. http://www.csie.ntu.edu.tw/-cjlin/libsvm}.2001.
    [20]HUANG N E, SHEN Z, STEVEN R L. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis [J]. Proc R Soc Lond A, 1998,454:903-995.
    [21]HUANG N E. Review of empirical mode decomposition [J]. Proceedings of SPIE-The international Society for Optical Engineering,2001,439(1):71-80.
    [22]许艳红.HHT在说话人识别中的应用[D].杭州：浙江大学,2005.
    [23]边肇祺,张学工.模式识别[M].北京：清华大学出版社,2000.288-304.
    [24]ZHANG Ling, ZHANG Bo. Relationship between support vector set and kernel function in SVM [J]. Comput. Sci. & Technol,2002,17(5):549.
    [25]祁亨年.支持向量机及其应用研究综述[J].计算机工程,2004,30(10)：6-9.
    [26]黄勇,郑春颖,宋忠虎.多类支持向量机算法综述[J].计算机技术与自动化,2005,24(4)：61-63.
    [27]杨璞.基于声门特征的说话人识别研究[D].杭州：浙江大学,2005.
    [28]林玮,杨莉莉,徐柏龄.基于修正MFCC汉语耳语音的话者识别[J].南京大学学报,2006,42(1)：54-62.
    [29]TOKUDA K, KOBAYASHI T, et al. Spectral estimation of speech based on mel-cepstral represention. Trans. IEICE,1991, J74-A:1240-1248.
    [30]FUKADA T, TOKUDA K. An adaptive algorithm for mel-cepstral analysis of speech. ICASSP,1992,I-137-I-140.
    [31]YANG Hong-wu, HUANG De-zhi, CAI Lian-hong.Perceptually Weighted Mel-Cepstrum Analysis of Speech Based on Psychoacoustic Model[J]. IEICE TRANS.INF.&SYST,2006,E 89-D(12):1:4.
    [32]李敏丹,张宏,童勤业.基于复杂性的说话人识别探讨技术[J].航天医学与医学工程, 2003,16(3)：215-219.
    [33]赵静,侯伯亨.基于小波变换说话人识别技术研究[J].西安电子科学大学学报,2000,27(4)：437-441.
    [34]李凌.Hilbert-Huang变换在说话人识别中的应用[D].湘潭：湘潭大学,2006.
    [35]LI Q, REYNOLDS D A. Corpora for the evaluation of speaker recognition systems. Proc IEEE Int'l Conf on ICASSP,1999, Vol.2,829-832.
    [36]黄伟.基于GMM/SVM和多子系统融合的与文本无关的话者识别[D].北京：中国科学技术大学,2004.
    [37]田克平.说话人识别技术研究与改进[D].桂林：桂林电子科技大学,2008.
    [38]王霞.声学模型研究探索.第五届人机语音通讯学会会议(NCMMSC'98)[C].144-148.
    [39]赵虹,韦丽华.基于支持向量机的说话人识别研究[J].现代电子技术,2007,(6)125-127.
    [40]侯风雷,王炳锡.基于支持向量机的说话人辨认研究[J].通信学报,2002,(6)62-67.
    [41]张亚芬.支持向量机在说话人识别中的应用[D].兰州：兰州理工大学,2007.
    [42]雷震春.支持向量机在说话人识别中的应用研究[D].浙江：浙江大学,2006.
    [43]杜娟.基于支持向量机的说话人识别[D].长春：吉林大学,2007.
    [44]KARAM Z N, CAMPBELL W M. A muticlass mllr kernel for SVM speaker recognition[C].//International Conference on Acoustics Speech and Signal Processing.Las Vegas,2008:4117-4120.
    [45]王志明.汉语视位建模及可视语音的研究[D].北京：清华大学,2003.
    [46]赵治栋,唐向宏,赵知劲,潘敏,陈裕泉.基于Hilbert-Huang Transform的心音信号谱分析[J].传感技术学报,2005,(18)：18-22.
    [47]孟子厚.单元音共振峰不变特征的初步分析[C].中国声学学会2006年全国声学学会会议论文集,2006.
    [48]孟子厚.普通话单元音女生共振峰统计特性测量[J].声学学报,2006,(31)：1-4.
    [49]YEGNANARAYANA B, REDDY K S, KISHORE S P. Source and system features for speaker recognition using AANN models. [C].//IEEE Int. Conf. Acoust, Speech, Signal Processing,2001,409-412.
    [50]GUPTA C S. Significance of source features for speaker recognition. [C].//M. S. thesis, Dept. Comput. Sci.Eng, Indian Inst. Technol.-Madras, Chennai, India,2003.
    [51]MURTY K S R, PRASANNA S R M, YEGNANARAYANA B. Speaker specific information from residual phase. [C].//Proc. Int. Conf. Signal Processing Communications,2004.
    [52]SHI G, SHANECHI M M, AARABI P. On the importance of phase in human speech recognition [J].IEEE TRANS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.14, NO.5:1867-1874,2006.
    [53]KIM D S. Perceptual phase quantization of speech [J]. IEEE Trans, Speech Audio Processing,2003,Vol.11(4):355-364.
    [54]MURTY K S, YEGNANARAYANA B. Combining evidence from residual phase and MFCC features for speaker recognition[C].//IEEE SIGNAL PROCESSING LETTERS, VOL.13, NO.1,2006.
    [55]牛科明,宗容.基于EMD技术的语音特征提取方法[J].科技信息(科学教研),2008,(9)：34-35.
    [56]杨录.基于EMD法的语音信号特征提取[J].微计算机信息,2007,(24)：222-225.
    [57]李凌,曾以成,雷雄国.EMD在说话人辨认中的应用[J].湘潭大学自然科学学报,2006,(28)：108-111.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700