基于麦克风阵列的语音增强与识别研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
自动语音识别技术对纯净的语音信号已经可以达到较高的识别精度。然而实际工作环境中环境噪声与混响的存在,以及其他声源的干扰,造成待识别语音特征与训练模板之间的失配,使得系统识别性能急剧下降。本论文针对以小尺寸麦克风阵为接收端的自动语音识别系统,研究若干宽带语音阵列处理方法,通过空时联合处理提高实际工作环境下语音信号被正确识别的概率。
     论文关于语音信号声源定位的研究,采用了基于旋转不变技术的信号参数估计(ESPRIT)算法的宽带到达方向角估计方法,并结合多通道语音线性预测分析和信噪比估计对算法进行了改进。实验证明,这种高分辨宽带信号处理方法应用在小尺寸麦克风阵接收的语音信号上,具有远优于常规波束形成方法的性能,且避免了其他典型高分辨方法中对整个角度域的扫描计算。定位结果用于指引后续阵列处理以提取从特定说话人方向到达信号。
     大多数现有麦克风阵语音识别系统包括阵列信号处理和特征识别两个先后独立的阶段。论文将阵信号处理和特征识别统一起来考虑,识别系统的输出被反馈至前端的麦克风阵列,结合识别过程调节滤波器系数,最大化似然概率的输出,滤波器系数调节中并采用全局搜索算法进一步改善联合优化方案的性能。与常规阵处理方法增强语音波形质量不同,论文研究增强语音特征使其与识别模型更为匹配,直接提高识别过程中正确假设的似然概率。实验证明,采用联合优化方案训练滤波器系数,系统的识别性能得到明显提高。
Automatic speech recognition (ASR) techniques have already been capable of achieving quite high recognition rates for clean speech. Under practical application environments, however, existence of environmental noises and reverberations, accompanied by interferences from other sound sources, can cause mismatch between the speech features to be recognized and the training templates, and thus severely degrades the performance of the recognition system. This thesis concerns development of array processing methods for wideband speech signals in the context of an ASR system with a small-sized microphone array in the front end. The goal is to, through joint spatial-temporal processing, increase the probability of correct speech recognition in practical environments.
     On speech source localization, a wide-band direction-of-arrival (DOA) estimation method based on the ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques) algorithm is developed, and further improved via a combination with multi-channel linear prediction analysis of speech signals as well as SNR estimation. Experiments with a small-size microphone array confirm that this method can achieve a very high spatial resolution for wide-band speech signals, far more superior to conventional beamforming methods, yet without beam-scanning across the entire angular domain required by other typical high-resolution methods. Source localization results are then used to guide the subsequent array processing to extract speech signals from the specified speaker.
     Most of the current microphone array ASR systems comprises two independent stages-array signal processing and feature recognition. This thesis considers the processing in those two stages in a joint way:outputs of the recognition stage are fed back to the front end; array filtering coefficients are then adjusted via an optimization procedure in which the likelihood of the right transcription is maximized for a selected vocabulary. In addition, a global searching algorithm is applied to further improve the performance of this joint optimization scheme. Different from conventional array processing aiming to enhancing signal waveform, the approach here enhances speech features to better match the recognition model, thus directly increasing the likelihood probability of correct hypotheses in recognition. Experiments clearly demonstrate the performance improvement of the proposed approach.
引文
[1]易克初,田斌,付强.语音信号处理.北京,国防工业出版社,2004.
    [2]S. Haykin. Array Signal Processing. New Jersey, Prentice-Hall,1985.
    [3]Flanagan J.L. et al. Computer-steered Microphone Arrays for Sound Transduction in Large Rooms. J. Acoust. Soc. Am.,1985,78(5):1508-1518.
    [4]赵贤宇,王作英.用于语音识别的鲁棒自适应麦克风阵列算法.清华大学学报.2004,44:1433-1436.
    [5]K.H. Davis, R. Biddulph, S. Balashek. Automatic Recognition of Spoken Digits. J. Acoust. Soc. Am.,1952,24(6):637-642.
    [6]H.F. Olson, H. Belar. Phonetic Typewriter. J. Acoust. Soc. Am.,1956, 28(6):1072-1081.
    [7]J.W. Forgie, C.D. Forgie. Results Obtained From a Vowel Recognition Computer Program. J. Acoust. Soc. Am.,1959,31(11):1480-1489.
    [8]J. Suzuki, K. Nakata. Recognition of Japanese Vowels--Preliminary to the Recognition of Speech. J. Radio Res. Lab,1961,37(8):193-212.
    [9]T. Sakai, S. Doshita. The Phonetic Typewriter. Information Processing, Proc. IFIP Congress, Munich,1962.
    [10]K. Nagata, Y. Kato, S. Chiba. Spoken Digit Recognizer for Japanese Language. NEC Res. Develop,1963, No.6.
    [11]T.B. Martin, A.L. Nelson. Speech Recognition by Feature Abstraction Techniques. Tech. Report AL-TDR-64-176, Air Force Avionics Lab,1964.
    [12]T.K. Vintsyuk. Speech Discrimination by Dynamic Programming. Kiberbetika, 1968,4(2):81-88.
    [13]D.R. Reddy. An Approach to Computer Speech Recognition by Direct Analysis of the Speech Wave. Tech. Report No.C549, Computer Science Dept., Stanford Univ., Sep.1966.
    [14]L. Rabiner, B.Juang. Fundamentals of Speech Recognition [M]. Englewood Cliff, Prentice-Hall,1993.
    [15]H. Sakoe. Two Level DP Matching--A Dynamic Programming Based Pattern Matching Algorithm for Connected Word Recognition. IEEE Trans. Acoustics, Speech, Signal,1979, ASSP-27(6):588-595.
    [16]C.S. Myers, L.R. Rabiner. A Level Building Dynamic Time Warping Algorithm for Connected Word Recognition. IEEE Trans. Acoustics, Speech, Signal,1981, ASSP-29(2):284-279.
    [17]C.H. Lee, L.R. Rabiner. A Frame Synchronous Network Search Algorithm for Connected Word Recognition. IEEE Trans. Acoustics, Speech, Signal,1989,37(11): 1649-1658.
    [18]L.R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE,1989,77(2):257-286.
    [19]K.F. Lee, H.W. Hon, D.R. Reddy. An Overview of the SPHINX Speech Recognition System. IEEE Trans. Acoustics, Speech, and Signal Processing,1990, 38(1):600-610.
    [20]O. L. Frost. An Algorithm for Linearly Constrained Adaptive Array Processing. Proceedings of IEEE,1972,60 (8):926-935.
    [21]L. J. Griffiths. An Alternative Approach to Linearly Constrained Adaptive Beamforming. IEEE Trans. Antenna Propagation,1982,30(1):27-34.
    [22]C. Allen. Adaptive Multi-beam Antennas for Spacelab. Antennas and Propagation Society International Symposium,1977, (15):420-423.
    [23]R.A. Zelinski. Microphone Array with Adaptive Post-filtering for Noise Reduction in Reverberant Rooms", Trans. Acoustics, Speech, and Signal Processing, 1988,ICASSP-5:2578-2581.
    [24]S. Gannot, I. Cohen. Speech Enhancement Based on the General Transfer Function GSC and Post-filtering. IEEE Trans. Speech Audio Processing,2004,12(6): 561-571.
    [25]R.O. Schmidt. Multiple Emitter Location and Signal Parameter Spectral Estimation. IEEE Trans. Acoustics, Speech, and Signal Processing,1986, ASSP-34(3):276-280.
    [26]D. Kunda. Modified MUSIC Algorithm for Estimating DOA of Signals. Signal Processing,1996,48(1):85-90.
    [27]R. Roy, T. Kailath. ESPRIT-Estimation of Signal Parameters via Rotational Invariance Techniques. IEEE Trans. Acoustics, Speech, and Signal Processing,1986, ASSP-37(7):984-995.
    [28]W. Mati, S. Tiejun, K. Thomas. Spatial-temporal Spectral Analysis by Eigenstructure methods. IEEE Trans. Acoustics, Speech, and Signal Processing,1984, ASSP-32(4):817-827.
    [29]H. Wang, M. Kaveh. Coherent Signal-Subspace Processing for the Detection and Estimation of Angles of Arrival of Multiple Wide-Band Sources. IEEE Trans. Acoustics, Speech, and Signal Processing,1985, ASSP-33:823-831.
    [30]G. Su, M. Morf. Modal Decomposition Signal Subspace Algorithms. IEEE Trans. Acoustics, Speech, and Signal Processing,1986, ASSP-34(3):585-602.
    [31]B. Ottersten, T. Kailath. Direction-of-Arrival Estimation for Wide-Band Signals Using the ESPRIT Algorithm. IEEE Trans. Acoustics, Speech, and Signal Processing, 1990, ASSP-38(2):317-327.
    [32]洪欧.麦克风阵列语音增强技术及其应用.传感器与仪器仪表,2006,22(1):142-145.
    [32]D.V. Compernolle, X. Ma. Speech recognition in noisy environments with the aid of microphone arrays. Speech Communication,1990,9(5-6):433-442.
    [33]S. Nordholm, I. Claesson. Adaptive array noise suppression of handsfree speaker input in cars. IEEE Trans. Vehicular Technology,1993,42(4):514-518.
    [34]O. Hoshuyama, A. Sugiyama. A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. IEEE Trans. Signal Processing,1999,47(10):2677-2684.
    [35]S. Gannot, D. Burshtein, E. Weinstein. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Processing,2001, 49(8):1614-1626.
    [36]S. Fischer, U.K. Simmer. Beamforming microphone arrays for speech acquisition in noisy environments. Speech Communication,1996,20(3-4):215-227.
    [37]J. Meyer, K.U. Simmer. Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction. IEEE International Conference on Acoustics, Speech, and Signal Processing,1997, ICASSP-97,2:1167-1170.
    [38]D. Mahmoudi, A. Drygajlo. Combined Wiener and coherence filtering in wavelet domain for microphone array speech enhancement. IEEE International Conference on Acoustics, Speech, and Signal Processing,1998, ICASSP-98,1:385-388.
    [39]Liu, Q.-G., B. Champagne. Room speech dereverberation via minimum-phase and all-pass component processing of multi-microphone signals. IEEE International Conference Communications, Computers, and Signal Processing,1995:571-574.
    [40]M.L. Seltzer, B. Raj, R. M. Stern. Likelihood-Maximizing Beamforming for Robust Hands-free Speech Recognition. IEEE Trans. Speech Audio Processing,2004, 12(5):489-498.
    [41]M.L. Seltzer, R.M. Stern. Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments. IEEE Trans. Audio, Speech and Language Processing,2006,14(6):2109-2121.
    [42]Thomas F. Q著,赵胜辉等译,离散时间语音信号处理,北京,电子工业出版社,2004.
    [43]S.B. Davies, P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. Acoustics, Speech, and Signal Processing,1980, ASSP-28 (4):357-366.
    [44]何强,何英.Matlab扩展编程.北京,清华大学出版社,2002.
    [45]D. G. Manolakis, V.K. Ingle, S.M. Kogon著,周正等译,统计与自适应信号处理,北京,电子工业出版社,2003.
    [46]Van Trees, L. Harry. Optimum Array Processing, Part IV:Detection, Estimation, and Modulation Theory. New York, Wiley,2002.
    [47]J. Capon. High resolution frequency-wavenumber spectrum analysis. Proceedings of IEEE,1969,57(8):1408-1418.
    [48]Q. Chen, W. Xu. Wideband multipath rejection for shallow water synthetic aperture sonar imaging. IET radar, sonar & navigation,2009,3(6):620-629.
    [49]C. Knapp, G. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. Acoustics, Speech, and Signal Processing,1976, ASSP-24 (4): 320-327.
    [50]G. Carter, A.H. Nuttall, P.G. Cable. The smoothed coherence transform. Proceedings of IEEE,1973,61 (10):1497-1498.
    [51]J. Dmochowski, J. Benesty. Broadband MUSIC:challenges and opportunities for Multiple source localization. Proceedings of IEEE WASPAA,2007:18-21.
    [52]P.J.Chung, J.F.Bohme, Detection of the number of signals using the Benjamini-Hochberg procedure. IEEE Trans. Signal Processing,2007,55(6): 2497-2507.
    [53]M. Delcroix, T. Hikichi, Precise dereverberation using multichannel linear prediction. IEEE Trans. Audio, Speech and Language Processing,2007,15(2): 430-440.
    [54]J. Nocedal, S. Wright. Numerical Optimization. New York:Springer-Verlag, 1999.
    [55]S. Chen, R. Istepanian, B.L. Luk. Digital ⅡR Filter Design Using Adaptive Simulated Annealing. Digital Signal Processing,2001,11:241-251.
    [56]C. Wallace, Sabine. Collected Papers on Acoustics-Acoustics & Sound [M]. Los Altos, Peninsula Publishing,1993.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700