与文本无关的嵌入式声纹识别门禁系统

英文题名：Text-independent Embedded Voice Recognition Door Manager System
作者：杨佳东
论文级别：硕士
学科专业名称：计算机系统结构
学位年度：2004
导师：于哲舟
学科代码：081201
学位授予单位：吉林大学
论文提交日期：2004-04-01

摘要

近些年来，生物特征识别技术因其良好的安全性越来越多的应用于身份识别。生物识别技术是利用人类自身生理或行为特征进行身份验证的一种解决方案，具有不可复制的特性。人体的生物特征包括指纹、声音、人脸、视网膜、虹膜、掌形、手掌静脉、骨架等等。所谓的生物识别的核心在于如何获取这些生物特征，并根据标准进行判决。
    语音是人类交流信息的基本手段,语音信号是个人的固有特征，随着信息科学技术的飞速发展，语音处理技术在最近20年中取得了突破性进展。
    语音信号的处理技术有几大分支----语音识别、语音合成、语音编码等。其中语音识别包含两个方向：声纹识别和语音内容识别。声纹识别分析的是说话人的语音的个性特征，从而识别出的结果是哪一位讲话者。他强调的是不同人之间语音信号本身的特征差异。声纹识别又可分为说话人辨认和说话人确认两种。前者是判定某一待识别的语音样本是语音库中哪一位的语音。后者是判定某一个待识别的声音“是或者不是”某一特定说话者的语音。其输出只有两种结果（是该说话人或者不是的二元判决）。
    语音的形成过程是与发音器官的运动密切相关的，这种发音器官的物理运动比起语音频率来讲要缓慢的多，因此语音信号常常可假定为在很短时间内是平稳的。语音识别的各种算法都是基于这种假设之上的。
    本文在对传统声纹识别方法研究的基础上开发了一个嵌入式与文本无关的声纹门禁系统。该系统用C语言编写，使之具有更好的可移植性。根据门禁系统得特点，本系统采用了声纹识别技术中说话人确认技术。本系统的主要流程包括预处理、声学参数分析及特征提取、模板形成、测度估计、判决等步骤。
    预处理包括：采样和量化、预加重滤波、加窗分帧、计算时域、频域参数、端点检测等。本系统利用双缓冲区技术实时采集语音样本，每秒采样8000次。得到采样值后将语音信号通过一个一阶高通滤波器 1-0.9375z-1 ，即预加重滤波器。它的目的在于滤除低频干扰，对于高频部分的频谱进行提升还可以起到消除直流漂移、抑制随机噪声的效果。为样本序列加窗分帧，每240个采样点合为一帧，帧移为80。并为每帧计算过零率与短时能量参数，根据经验阈值进


    行端点检测。
    现在语音识别方法中大多用LPCC和MFCC参数作为特征参数，根据前人的研究成果可知，对于声纹识别而言，MFCC参数载荷了更多的说话人个性特征，所以本系统也采用了MFCC参数作为特征参数。并根据实验结果可知MFCC的一阶差分参数更多的体现的是与文本相关时前后语音帧的特点，所以本系统并不采用差分参数,只为每帧提取16维MFCC参数，得到参数序列。
    本系统采用隐马尔科夫模型为说话人训练模板。HMM的应用是20世纪80年代以来语音识别领域取得的最重要的成果。HMM一方面用隐含的状态对应于声学层各相对稳定的发音单位，并通过状态转移和状态驻留来描述发音的变化；另一方面它引入了概率统计模型，用概率密度函数计算语音参数对HMM模型的输出概率，通过搜索最佳状态序列，以最大后验概率为准则找到识别结果。所以，HMM模型较为完整的表达了语音的声学模型，并且采用统计的训练方法将底层的声学模型和上层的语言模型融入统一的语音识别搜索算法中，可以获得较好的效果。HMM的具体训练流程为：首先为状态参数设置初值，然后利用Viterbi算法为输入的语音参数序列计算输出概率，根据此输出概率利用Baum-Welch算法重新设置状态参数，并判断模型是否收敛，不收敛则重复进行。本系统对传统的Baum-Welch模型训练算法进行了改进：先对语音信号特征矢量进行状态分割、动态聚类，再运用模糊统计的方法寻找出B参数，然后进行迭代重估。这样较之凭经验反复对比设置的B参数更为有依据、有效果，不但可以减少迭代次数，更重要的是在一定程度上避免了模型发散和参数收敛到全局最优点，降低了运算量和存储量。
    在HMM模型训练过程完成以后，测试语音利用Viterbi算法计算出了其对于收敛模板的输出概率，本系统就以此概率为距离测度，通过对该概率与预先设置的阈值作欧式距离进行判决。
    本系统的判决方法采用了一种多方法多门限关联技术。“多方法”即利用多方法序贯识别，本系统采用了端点检测时得到的平均过零率、平均短时能量以及HMM三种方法串联判决。这样不仅可以提高系统效率，而且跟任一方法单独使用相比系统的误识率都会得到降低，但在一定程度上系统的拒识率会加性升高。这种误识率与拒识率的矛盾很难解决。为此本系统又采用了多门限关联技术。由于每种方法判决都需要设置阈值，本系统根据输入的多个语音样本序


    列自动计算出高低两个阈值，后一种方法采用哪个阈值是与前方法的判决结果有关，换句话说，在前一种方法下测试语音与模板相似度越高，下一级方法需要的阈值就越低。根据实验结果来看，这种技术部分解决了误识率与拒识率的矛盾，使二者都达到了满意的结果。
    虽然本系统得到了较好的实验结果，但距离实际应用还相差较远，有很多方面并不完善：
    通过实验可知模版的“好坏”对识别结果有决定性的影响。所谓模版“坏”是指说话人录入的十条用作训练的语音样本必须完全反应说话人最正常情况下的说话特点，如果不能则
In recent years, biometrics recognition technique is more and more used in persons’ identity recognition for its great security. Biology recognition technique, which can not be copied, is a solution scheme to validate identity by using the physiology and the action character of human self. Biology character of human includes dactylogram, voice, face, retina, iris, palm form, palm vein, skeletal framework and so on. The core of the biology recognition is how to get those biology characters and how to adjudge by the standard.
    Voice is the basic instrumentality of the human’s information communication. Voice signal is the individual inherence character. The voice processing technique has taken a huge progress with information science technique’s improvement.
    The voice signal processing technique has some branches — Voice Recognition, Voice Compound, Voice Coding, and so on. VR has two aspects: Speaker Recognition and Voice Content Recognition. SR analyses the individual character of the speaker’s voice, and then finds which that speaker is. What it emphasized is a characteristic difference of the voice's signal itself between different people. Speaker recognition also has two aspects: Speaker SI（Speaker Identification）and SV（Speaker Verification）.The former judges that a certain voice sample waiting to be discerned is whose voice in the voice database; the latter judges that the speaker " is or not is" the voice of a certain specific one a piece of waiting to be discerned. Its output has only two kinds of results (the only two results of whether the speaker is or not.).
    The forming process of the voice is closely related to sport of the vocal organs. The physics sports of vocal organs are more slowly compared with voice frequency. So voice signals can often be assumed steady in a very short time. Various kinds of arithmetic of voice recognition are because of this kind of assumption.
    This paper developed an embedded text-independent voice door manager system based on the research of voice recognition technique. This system is written by C, makes it better portability. According to the characters of the door manager system , this system adopted Speaker Verification of Speaker recognition. The main flow of this system includes pretreatment, acoustics parameter analyzing and characteristic


    draw, mode forming, measure estimate, adjudge, and so on.
    The pretreatment includes: Sample and quantization, aggravate in advance and filtrating wave, adding window and dividing into frame, calculating parameters of time field and frequency field, extreme point measuring etc. This system utilizing pairs of buffer technology gather voice sample in real time, 8000 times samples per second. Samples then pass a one rank high-pass filter 9375 z - 1, namely aggravate filter in advance. The purposes of it lie in excepting that low frequency interference, promoting frequency spectrums of part, dispelling direct current drift, and suppressing the result of the noise at random. Adding window and dividing into frame, every 240 samples are combined to one frame. The frame is moved for 80 for sample array, then calculates the rate out of zero and energy parameter at every frame, and carries on extreme point measure according to experience threshold value.
    Now, LPCC parameter and MFCC parameter are mostly used as the characteristic parameter in voice recognition method. To speaker recognition, MFCC parameter loads more individual character characteristic, so this system has adopted MFCC parameter as the characteristic parameter. According the result of the experiment, one rank difference parameter of MFCC embodies the forward-and-back voice frame’s characteristic more when text-dependent. So this system didn’t adopt the one rank difference parameter, distilled the MFCC parameter at 16 dimensions for every frame only, and got the parameter array.
    This system adopted Hidden Markov Model (HMM) for speaker to train the template. The application of HMM is the most important achievement that the voice recognition field has been ma

引文

【1】“生物识别技术为信息安全保驾护航”
     http://www.jxmftech.com/sb-shengwu.htm
    【2】陈方高升,“语音识别技术及发展”,电信科学 1996年10月。
    【3】吴玺宏，“声纹识别听声辨人”，《计算机世界》2001年8月。
    【4】王涛徐乃平，“声纹识别及其应用的研究”，微机处理，1997年4月
    【5】于世功田岚李传林， “不依赖于文本的声纹识别研究及其应用” ，计算机工程与应用，2000年4月。
    【6】易克初田斌付强编著，《语音信号处理》，国防工业出版社，2000年第一版。
    【7】张宪超武继刚蒋增荣陈国良，“离散傅里叶变换的算术傅里叶变换算法” ，电子学报，2000年第5期。
    【8】李蕴华，“利用语音倒谱参数及基音信息辨认说话人” ，南通工学院学报 1999年第6期。
    【8】朱晓园，“一个对隐马尔科夫模型用于自由语句说话人的研究”，北方交通大学学报，1997年2月。
    【9】杨华民李平　姜会林　龚越　杨勇，“神经网络语音识别技术应用研究”，长春光学精密机械学院学报，1997年第20卷第1期。
    【10】何强何英编著，“MATLAB扩展编程”，清华大学出版社，2002年6。
    【11】潘洁，“基于序贯判决法的自动声纹识别”，微机发展，2000年第1期。
    【12】张怡颖　朱小燕　张钹，“一种新的说话人确认方法”，软件学报，1999年4月第10卷第4期。
    【13】何致远胡起秀徐光，“两级决策的开集说话人辨认方法”，清华大学学报 (自然科学版)，2003年第43卷第4期。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700