基于若干声纹信息空间的说话人识别技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着说话人识别技术的日臻成熟,研究人员开始专注于实际应用中面临的各种问题,提出合理的解决方案,以不断的提升系统性能,将说话人识别技术真正地推向实用化。而在研究和应用当中,如何获取体现说话人身份的声纹信息,以及如何应用这些信息进行辨识都是当前的热点研究问题。声纹信息是一种超音段信息,承载这种信息的载体分布于对应说话人所说的全部语音数据当中,但是不同的声纹信息载体反映说话人信息的能力并不相同。本文将承载说话人身份信息的某种载体所能表征的所有相关声纹信息,称为声纹信息空间。于是在语音数据中,就存在若干个能够用于说话人识别技术的声纹信息空间。本文将从音素空间,时域空间,频率空间,深层特征空间这四个层面的声纹信息空间入手,展开深入的研究,以求在相应的声纹空间中获得合适的特征表述,确定合适的建模方法。本论文主要的工作如下:
     1.基于音素空间的说话人识别技术
     音素片段当中不仅包含文本信息,同时也包含有说话人身份的信息,因此音素片段是一种声纹信息的载体。可以通过这个载体表征的所有声纹信息,被称之为音素空间。本文目的在于,在这个独特的声纹信息空间当中,提取并应用反映说话人身份特性的信息。首先本文借助音素级别的谱包络模版集合,来揭示不同说话人之间的身份差异。进一步的,为了消除单一谱包络模版集合表征的音素空间不完备,造成的声纹信息遗漏的问题,因此本文引进了多组谱包络模版集合来刻画声纹信息。使用音素模版集合刻画声纹信息,类似于在音素空间中进行声纹信息的编码过程,因此我们称这种方法为多语言编码的说话人识别系统。为了量化音素模版代表的说话人信息,本文同时使用最大似然线性回归准则估计出来的映射矩阵和偏移向量来体现这些声纹信息。最后,为了获取多个音素空间中谱包络模版集合之间的信息互补能力,本文尝试了多种合理的信息融合策略。实验表明,在音素空间中,本文提出的方法获得了系统性能上的提升,达到了本文的预期目的。2.基于时域空间的说话人识别技术
     相同说话人在不同的通信环境下,以及不同的自身状态下,产生的语音数据在表达形式上差异很大。而体现说话人特性的声纹信息,却蕴藏在这些产生在不同时间点的表达形式当中。本文把蕴藏在来自不同时间段语音内部的声纹信息,称为时域空间。常用的说话人识别系统在这种变化的环境下,识别性能会遭受较大的衰减。传统的方法使用因子分析或者扰动属性映射来消除这些不利的影响,而本文试图用非监督自适应模型的方法来解决时域空间中的这一问题。非监督自适应的方法,在模型训练的过程中,不停地使用采集自不同时间段的语音数据来更新模型,这有效的利用了分布在不同时间段上的声纹信息。本文首先回顾了非监督自适应方法在模型域上的实现,介绍了硬判决和软判决这两种更新策略。然后提出了非监督自适应在得分域上的改进算法。通过定义得分先验分布,以及得分置信度,最终得到针对得分规整的一种非监督算法。这种时域空间上说话人识别技术,避免了模型域更新带来的大规模的计算复杂度,同时也可以获得不错的识别性能。3.基于频域空间的说话人识别技术
     语音信号频谱上的各个频带之间存在着一定的相关特性,这种相关特性不仅揭示了语音的文本信息,同时也反映着说话人身份的信息。这种说话人信息载体所体现的所有声纹信息被称为频域空间。为了揭示频域空间中包含的声纹信息,以及它们所具有的话者识别能力,本文首先通过实验,证明了协方差建模对于描述声纹信息分布起着的较为重要的作用。由于协方差矩阵在真实环境中,面临着参数估计较为困难的情况,本文提出了两种稳定的参数估计方法。在获得了协方差估计之后,通过构造与均值超级向量相似的信息表达方式,得到了协方差超级矩阵。随后,本文提出了两种超级矩阵的距离度量公式,来表征频域空间上的声纹信息的相似程度。最后通过合理的分类器设计,在频域空间获得了与主流均值超级向量系统相似的识别性能,并且与之有一定的互补能力。
     4.基于深层特征空间的说话人识别技术
     传统的识别系统中,无论是建模过程还是特征提取,均可以使用浅层结构来解释。本文试图利用深层神经网络结构,来探索蕴藏在频谱信息当中,深层特征空间上的声纹信息。本文首先通过深层神经网络结构,来模拟人类对于声纹信息的感知。深层神经网络结构的训练分为两个部分:一个是非监督的特征扩展操作,在这个阶段中,网络结构将原始的语音数据映射为具有抽象概括能力的深层特征表示。但是经由深层网络结构获得的深层特征表征,并没有清楚的分离说话人信息与其他非说话人的信息。于是本文提出了网络训练的另一个重要步骤,即精细调节操作。这个步骤作用在深层特征空间中,目的是进一步提取声纹信息。为此本文提出了两个限制条件,即通过稀疏编码限制以及说话人距离限制。在深层特征空间,精细调节网络结构,尽可能地分离说话人相关和无关这两部分信息。为了避免深层特征空间中的声纹信息被其他因素干扰,本文选择干净的TIMIT数据库进行实验验证。目前的实验结果表明,基于深层网络结构获取的深层特征空间上声纹信息,具有很好的识别性能,并且与传统的声学特征有很强的信息互补能力。在深层特征空间中获取的实验结果,为进一步研究说话人身份感知的机理提供了有力的支持。
With the development of speaker recognition technologies, researchers start to focus on practical problems in real application of these technologies. To take good advantage of speaker recognition technologies, more and more effective solutions are introduced to meet different actual requirements, and to improve recognition performance. How to extract representative voiceprint feature and how to model accurate speaker model are still the key problems in nowadays research. Voiceprint feature is a kind of suprasegmental information which is located in the whole speech data, but not uniformly contained in each detailed cue of the corresponding speech data. The carriers of the voiceprint feature originate from different information space due to different interpretation of speaker dependent information. In this dissertation, we define voiceprint information space as all speaker dependent information which a kind of carrier can obtain. We will explore phonetic voiceprint space, temporal voiceprint space, frequency voiceprint space and deep structured feature space. In these spaces, we will focus on acquiring effective representation of voiceprint and setting up modeling method.
     Firstly, we build up a multilingual coding based speaker recognition system in phonetic voiceprint space. Phonetic segments contain not only textual information but also speaker dependent information. It is an effective carrier for voiceprint feature. In this part of work, we try to extract and apply voiceprint feature resided in this space. To obtain voiceprint feature, a set of phonetic patterns is used to reveal the speaker dependent information. Extracting speaker information with phonetic patterns works like coding process in this unique phonetic space. Furthermore, multi-sets of phonetic patterns are introduced to make this phonetic voiceprint space more completed. Like traditional MLLR-SVM system, we also use MLLR transforms to represent voiceprint feature from phonetic patterns for each speech segment. Because these sets of phonetic patterns are used paralleled in acquiring speaker information, we call this method as multilingual coding based MLLR-SVM speaker recognition system. Also, several combination strategies are applied to gather speaker information from different phonetic voiceprint space in order to improve the performance.
     Secondly, speaker dependent information is contained in variable speech realizations which include speech segments from different communication channels, and ones from different personal feelings. Since different speech realizations are formed in different time, voiceprint feature in these speech segments are called temporal voiceprint space. In this situation, speaker recognition system could suffer great performance attenuation. Traditionally, researchers use joint factor analysis (JFA) and nuisance attribute projection (NAP) to solve the problems. In this dissertation, we try to work out this problem by using unsupervised adaptation method. It could update parameters all the time when there is a new available training data. It is effective due to capturing voiceprint feature in the temporal space. Comparing model-based method, we introduce score-based unsupervised method with hard and soft decision strategy. By defining prior score distribution and score confidence, we finally get an unsupervised score normalization method. This method can bring nice performance and reduce the computational cost.
     Thirdly, there are inside correlation among different frequency bands. This kind of relationship not only reflects textual information but also contains speaker dependent information. We define these information come from frequency voiceprint space, and also in this space, this dissertation will try out performance of speaker recognition system. Covariance matrices are introduced to describe the voiceprint in frequency bands. Due to difficulty in estimation of the covariance, we provide two kinds of stable estimation methods. Like traditional mean supper-vector, we construct a covariance supper-matrix to represent the voiceprint. To measure the similarity of these information carriers, two distance metrics are given. Finally, with support vector machine and linear inner classifiers, we set up a speaker recognition system in frequency voiceprint space which performs equally well as traditional mean supper-vector systems.
     Finally, we explore voiceprint feature in deep structured space. In nowadays research, feature and model method can both be explained by shallow structure. Deep structure could reveal information which is constructed by more than two or three layers of nonlinear nodes. In this dissertation, we will try to find out voiceprint in deep structured space with deep neural networks. There are two steps in training deep neural networks. One is pretraining step which is a supervised feature expanding method with deep structures. The expanded feature which comes from deep structured feature space contains more general and abstract information. In this step, the feature couldn't tell speaker dependent information from speaker independent information, because information from voiceprint space is equal to one from other space. So we introduce the other step which is called finetuning to separate voiceprint feature from others. In this dissertation, we provide two constraint conditions to achieve this aim. They are sparse coding method and speaker distance based method. To verify the effectiveness of the voiceprint in deep structured space, and also to avoid the interference from other information, we use TIMIT as our database in experiments. Preliminary results have shown that voiceprint in deep structured space could give much better performance than traditional method. Also our provided system can be combined with baseline system to receive significant improvement.
引文
Acero A,1993. Acoustic and environmental robustness in automatic speech recognition. Kluwer Academic Publishers.
    Arsigny V, Fierre P, Pennec X, and et al.,2008. Geometric means in novel vector space structure on symmetric positive definite matrices. SIAM journal on matrix analysis and applications, vol.29, no.1,328-347.
    Atal B S,1972. Automatic speaker recognition based on pitch contours. Journal of the Acoustical Society of America, vol.52, no.6B,1687-1697.
    Auckenthaler R, Carey M, and Lloyd-Thomas,2000. Score normalization for text-independent speaker verification systems. Digital Signal Process, vol.10,42-52.
    Auckenthaler R, Parris E S, and Carey M J,1999. Improving a GMM speaker verification system by phonetic weighting. ICASSP 1999.
    Bengio Y,2009. Learning deep architectures for AI. Foundations and trends in machine learning vol.1, no.2,1-127.
    Bengio Y, Lamblin P, Popovici D, and et al.,2007. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, vol.19,153-160.
    Bengio Y, and LeCun Y.2007. Scaling learning algorithm towards AI. Large-Scale Kernel Machines. MIT Press.
    Benzeghiba M F, Gauvain J L, and Lamel L.2009. Language score calibration using adapted Gaussian back-end. INTERSPEECH 2009.
    Bergstra J, Breuleux O, Bastien F, and et al.,2010. Theano:A CPU and GPU math compiler in Python. SCIPY 2010.
    Bertsekas D,2003. Convex analysis and optimization. Athena Scientific.
    Bishop C M,2006. Pattern Recognition and Machine Learning. Springer.
    Brummer J N and Strydom L R,1997. An Euclidean distance measure between covariance matrices of speech cepstral for text-independent speaker recognition. Communication and signal processing 1997.
    Brummer N, and Preez J D,2006. Application-independent evaluation of speaker detection. Computer Speech and Language, vol.20,230-275.
    Campbell W M.2002. Generalized linear discriminant sequence kernels for speaker recognition. ICASSP 2002,161-164.
    Campbell W M, Sturim D E, and Reynolds D A,2006a. SVM based speaker verification using A GMM supervector kernel and NAP variability compensation. ICASSP 2006,97-100.
    Campbell W M, Sturim D E, and Reynolds D A,2006b. Support vector machines using GMM supervectors for speaker recognition. IEEE Signal Processing Lett., vol.13, no.5,308-311.
    Chen K, and Salman A,2011. Learning speaker-specific characteristics with a deep neural architecture. IEEE Trans. On Neural Networks, vol.22, no.11,1744-1756.
    Collobert R, and Bengio S,2001. SVMTorch:Support vector machines for large-scale regression problems. Journal of Machine Learning Research, vol.1,143-160.
    Datta B N,1995. Numerical linear algebra and application. Brooks/Cole.
    Dehak N,2009. Discriminative and generative approaches for long- and short-term speaker characteristics modeling:application to speaker verification. Thesis. CRIM.
    Dehak N, Kenny P, Dehak T, and et al.,2009. Support vector machines and joint factor analysis for speaker verification. ICASSP 2009.
    Dehak N, Kenny P J, Dehak R, and et al.,2011. Front-end factor analysis for speaker verification. IEEE Trans. On Audio, Speech, and Language Processing, vol.19, no.4,788-798.
    Dempster A P, Laird N M, and Rubin D B,1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, vol.39, no.1,1-38.
    Dutoit T,2001. An introduction to text-to-speech synthesis. Kluwer Academic Publishers.
    Farrell K R, Mammone R J, and Assaleh K T,1994. Speaker recognition using neural networks and conventional classifiers. IEEE Trans. On Speech and Audio Processing, vol.2, no.1, 194-204.
    Furui S,1981. Cepstral analysis technique for automatic speaker verification. IEEE Trans on Acoustics and Speech Signal Process, vol.29, no.2,254-272.
    Gales M J,1998. Maximum likelihood linear transformation for HMM-based speech recognition. Computer Speech Language, vol.12,75-98.
    Garcia-Romero D and Espy-Wilson C Y,2011. Analysis of i-vector length normalization in speaker recognition systems. INTERSPEECH 2011.
    Gauvain J L and Lee C H,1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. On Speech and Audio Processing, vol.2, no.2,291-298.
    Ghahramani Z, and Hinton G E,1996. The EM algorithm for mixtures of factor analyzers. Technical report CR-TR-96-1. University of Toronto.
    Greenberg C, Martin A, Brandschain L, and et al.,2010. Human assisted speaker recognition in NIST SRE10. Odyssey 2010.180-185.
    Hatch A O, Kajarekar S, and Stolcke A,2006. Within-class covariance normalization for SVM-based speaker recognition. ICSLP 2006,1471-1474.
    Hermansky H,1990.Perceptual linear predictive analysis of speech. Journal on Acoustical Society of America, vol.87, no.4,1738-1752.
    Hermansky H, Morgan N, Bayya A, and et al.,1991. RASTA-PLP speech analysis. International Computer Science Institute.
    Hieronymus J L,1994. ASCII phonetic symbols for the world's language:Worldbet. Technical Report, AT&T Bell Labs, New York.
    Hinton G E, Osindero S, and The Y W,2006. A fast learning algorithm for deep belief nets. Neural Computation, vol.18, no.7,1527-1554.
    Huang X, Acero A, and Hon H,2001. Spoken language processing:A guide to theory, algorithm, and system development. Prentice Hall.
    Kajarekar S, Ferrer L, Sonmez K, and et al., Modeling NERFs for speaker recognition. Speaker Odyssey 2004,51-56.
    Kenny P,2005. Joint factor analysis of speaker and session variability:Theory and algorithm. Technical report CRIM-06/08-13, CRIM.
    Kenny P, Boulianne G, and Dumouchel P,2005. Eigenvoice modeling with sparse training data. IEEE Trans. On Speech and Audio Processing, vol.13, no.3,345-359.
    Kenny P, Boulianne G, Ouellet P, and Dumouchel P,2007.Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. On Audio, Speech and Language Processing, vol.15, no.4,1435-1447.
    Kenny P, Mihoubi M, and Dumouchel P,2003. New MAP estimates for speaker recognition. EUROSPEECH 2003,2964-2967.
    Kinnunen T, and Gonzalez-Hautamaki R,2005. Long-term FO modeling for text-independent speaker recognition. SPECOM 2005,567-570.
    Kinnunen T, and Li H Z,2010. An overview of text-independent speaker recognition:from features to supervectors. Speech Communication, vol.52,12-40.
    Lamel L, Rabiner L, Rosenberg A, and et al.,1981. An improved endpoint detector for isolated word recognition. IEEE Trans. On Acoustic, Speech, and Signal Processing, vol.29, no.4, 777-785.
    LeCun Y, and Huang F J,2005. Loss functions for discriminative training of energy-based models. Art. Intell. Statist.2005.
    Lee H L, Battle A, Raina R, and et al.,2006. Efficient sparse coding algorithms. NIPS 2006.
    Lee T S, Mumford D, Romero R, and et al.,1998. The role of the primary visual cortex in higher level vision. Vision research, vol.38,2429-2454.
    Lei H. and Mirghafori N,2007. Word-conditioned phone N-grams for speaker recognition. ICASSP 2007.
    Long Y H, DaiLR, Ma B, and et al.,2010. Effects of the phonological relevance in speaker verification. Interspeech 2010.
    Longworth C, and Gales M.J.F,2009. Combining derivative and parametric kernels for speaker verification. IEEE Trans. On Audio, Speech, and Language Processing, vol.17,748-757.
    Lu X G, and Dang J W,2008. An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification. Speech Communication, vol.50, no.4,312-322.
    Mackay D J,2003. Information theory, inference, and learning algorithms. Cambridge University Press.
    Mirghafori N and Hebert M,2004. Parameterization of the score threshold for a text-dependent adaptive verification system. ICASSP 2004,361-364.
    Ng A Y,2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML 2004.
    NIST,2004. The NIST year 2004 speaker recognition evaluation plan. http://www.nist.gov/speech/tests/spk/2004/doc/2004-spkrec-evalplan-v3.pdf.
    NIST,2006. The NIST year 2006 speaker recognition evaluation plan. http://www.nist.gov/speech/tests/spk/2006/sre-06-evalplan-v9.pdf.
    NIST,2008. The NIST year 2008 speaker recognition evaluation plan. http://www.nist.gov/speech/tests/spk/2008/index.htm.
    NIST,2010. The NIST year 2010 speaker recognition evaluation plan. http://www.itl.nist.gov/iad/mig/tests/sre/2010/NIST_SRE10_evalplan.r6.pdf
    Olshausen B A, and Field D J,1997. Sparse coding with an overcomplete basis set:a strategy employed by vl? Vision Research, vol.37,3311-3325.
    Patane G and Russo M,2001. The enhanced LBG algorithm. Neural Networks, vol.14, 1219-1237.
    Ranzato M A, Boureau Y L, and LeCun Y,2007. Sparse feature learning for deep belief networks. NIPS 2007.
    Reynolds D A,1995. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, vol.17,91-108.
    Reynolds D A, Quatieri T F, and Dumn R B,2000. Speaker verification using adapted Gaussian mixture model. Digital Signal Processing, vol.10, no.1-3,19-41.
    Rumelhart D E, Hinton G E, and Williams R J,1986. Learning representations by back-propagating errors. Nature, vol.323,533-536.
    Scheffer N, Lei Y, and Ferrer L,2011. Factor analysis backs ends for MLLR transforms in speaker recognition. INTERSPEECH 2011,257-260.
    Schmidt M, Gish H, and Mielke A,1995. Covariance estimation methods for channel robust text-independent speaker identification. ICASSP 1995.
    Schmidt N A, and Crystal T H,1998. Human vs. machine speaker identification with telephone speech. ICSLP 1998.
    Shriberg E,2007. Higher-level features in speaker recognition. Speaker Classification I, LNAI 4343,241-259.
    Slyh R E, Hansen E G, and Anderson T R,2004. Glottal modeling and closed-phase analysis for speaker recognition. Odyssey 2004.
    Soong F K, Rosenberg A E, Juang A E, and et al.1987. A vector quantization approach to speaker recognition. AT&T Technical Journal 66,14-26.
    Stolcke A, Kajarekar S S, Ferrer L, and et al.,2007. Speaker recognition with session variability normalization based on MLLR adaptation transforms. IEEE Trans. On Audio, Speech, and Language, vol.15, no.7,1987-1998.
    Stolcke A, Kajarekar S S, and Ferrer L,2008. Nonparametric feature normalization for SVM-based speaker verification. ICASSP 2008.1577-1580.
    Sturim D E and Reynolds D A,2005.Speaker adaptive cohort selection for tnorm in text-independent speaker verification. ICASSP 2005.
    Torres-Carrasquillo P A, Singer E, Kohler M A and et al.,2002. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. ICSLP 2002, 89-92
    Vapnik V N,1999. The nature of statistical learning theory. Springers.
    Vincent P, Larochelle H, Bengio Y and et al.,2008. Extracting and composing robust features with denoising autoencoders. ICML 2008,1096-1103.
    Vogt R, and Sridharan S,2006. Experiments in session variability modeling for speaker verification. ICASSP 2006,897-900.
    Wang E Y, Guo W, and Dai L R,2008. Parallel phone recognizer based MLLR speaker recognition. ISCSLP 2008.
    Wang E Y, Lee K A, Ma B, and et al.,2010a. The estimation and kernel metric of spectral correlation for text-independent speaker verification. INTERSPEECH 2010.
    Wang E Y, Guo W, Dai L R, and et al.,2010b. Factor analysis based spatial correlation modeling for speaker verification. ISCSLP 2010.
    Wang E Y, Lee K A, Ma B, and et al.,2011. Factored covariance modeling for text-independent speaker verification. ICASSP 2011.
    Witt S M,1999. Use of speech recognition in computer-assisted language learning. Thesis. Cambridge.
    Whittaker S, Hirschberg J, Choi J and et al.,1999. Scan:Designing and evaluating user interfaces to support retrieval from speech archives. SIGIR 1999.
    Yin S C, Rose R, and Kenny P,2007. A joint factor analysis approach to pregressive model adaptation in text-independent speaker verification. IEEE Trans. On Audio, Speech and Language Processing, vol.15, no.7,1999-2010.
    Young S,2000. The HTK book version 3.0. http://svr-www.eng.cam.ac.uk.
    Zissman M A,1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. On Speech and Audio Processing, vol.4, no.1,31-44.
    郭武,2007.复杂信道下的说话人识别.中国科学技术大学博士论文.
    李轶杰,郭武,戴礼荣,2008.话者识别的信道补偿.小型微型计算机系统,第29卷,第12期,2344-2347.
    龙艳花,2011.基于SVM的话者确认关键技术研究.中国科学技术大学博士论文.
    王尔玉,郭武,李轶杰,等,2009.采用模型和得分非监督自适应的说话人识别,自动化学报,第35卷,第3期,267-271.
    王仁华,1991.自动说话人识别.信号处理学报,第7卷,第4期,193-200.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700