语音识别剪枝算法研究

英文题名：Research on Pruning Algorithm for Speech Recognition
作者：陆遥
论文级别：硕士
学科专业名称：信息与信号处理
中文关键词：本地服务搜索 ; 识别网络 ; 剪枝 ; 前瞻树
英文关键词：voice-based local search ; recognition network ; pruning ; Language Model look-ahead tree
学位年度：2012
导师：刘刚
学科代码：081203
学位授予单位：北京邮电大学
论文提交日期：2012-02-20

摘要

语音是人类之间交互最为有效的手段,使用语音输入代替传统的键盘和触屏输入,将会使用户操作更为便捷——将语音技术用于人机交互也一直是前沿的研究课题。伴随着移动终端的发展和移动互联网的飞速发展,基于移动终端的语音识别技术也随之成为了新的研究热点。和已经比较成熟的语音识别技术相比,基于移动终端的识别技术将面临更多的技术挑战,主要表现在移动终端的计算能力有限,根据传统算法开发的识别技术不能满足作为语音输入的实时性能要求；另外,高精度的语音识别需要描述能力较强的语言模型支持——一般都采用基于统计的n-gram模型,而该模型所需要的存储空间较大,很难在移动终端上得到全面的应用。因此当前对基于移动终端的识别技术的主要问题集中在压缩识别器的计算空间和加速识别器的解码速度这两个方面。
     针对上述问题,本文提出了一种根据用户所在的地点信息进行剪枝的算法,该算法能够有效的提升识别器的性能。本文的主要工作内容如下：
     1.基于有限状态语法识别系统的剪枝算法实现
     孤立词识别系统和固定句式识别系统是语音识别系统的基础,本文首先对这种简单的识别网络进行了剪枝算法的研究,提出了根据用户地点信息对识别网络进行剪枝的算法,并用大量的实验证明了算法的有效性。
     2.基于连续语识别系统的剪枝算法实现
     以有限状态识别网络剪枝算法的研究为基础,本文利提出了利用语言模型前瞻树结构在解码时进行剪枝的算法,该算法根据用户的地点信息对解码器的搜索空间进行了有效地剪枝。通过对比实验证明,该算法能够有效的提升识别器的识别精度和速度。
Speech is the most efficient way for human-beings'communication. For using speech input instead of traditional keyboard or touch screen input will make user feeling much more convenient, using speech recognition technology into human-computer interaction has been a cutting-edge research, In other hand, with rapid development of mobile terminals and mobile Internet, the speech recognition technology based on mobile terminals will become a new hotspot.
     Compared to proven speech recognition technique, the mobile terminal based speech recognition technology faced much more technical challenges. It's mainly because of the mobile terminals'hardware is limited, which caused two problems:Firstly, the real-time performance cannot be met when using traditional recognition technology as the speech input system's core algorithm; Secondly, high accuracy speech recognition system requires language model support, generally is statistical n-gram model, which is difficult to be used in the mobile terminal application for the larger storage space requirement. Current the researches on the mobile terminal based speech recognition technology are focused on the speeding up computation and storage compression.
     According to those problems, we proposed and implement an effective location based pruning algorithm. The mainly work is as follows:
     1. Pruning algorithm based on finite-state grammar recognition system
     Firstly we researched on isolated word recognition system and fixed sentence recognition system, which are basic speech recognition systems and have the sampler recognition network than large vocabulary continuous speech recognition system. After lots of work, we proposed an algorithm to pruning network with user's location information and used large number of experiments to prove the effectiveness of the algorithm.
     2. Pruning algorithm based on continuous speech recognition system
     Based on finite-state recognition network pruning algorithm research, we proposed an algorithm which uses language model look-ahead tree structure and user's location information to solve the same problem in continuous speech recognition system. By comparing the experiments among different systems we prove that the algorithm can effectively enhance the system's recognition accuracy and speed performance.

引文

[1]刘加,汉语大量连续语音识别系统研究进展,电子学报,2000,28(1)：85-91
    [2]Ye-Yi Wang, Dong Yu, Yun-Cheng Ju, and Alex Acero, "An Introduction to Voice Search", In proc. IEEE SIGNAL PROCESSING MAGAZINE, pp.29-38, May.2008
    [3]S. Mann, A. Berton, and U. Ehrlich, "How to access audio files of large data bases using in-car speech dialogue systems," in Proc. Interspeech, Antwerp, Belgium, 2007, pp.138-141.
    [4]G. Zweig, P. Nguyen, Y.-C. Ju, Y.-Y. Wang, D. Yu, and A. Acero, et al., "The voicerate dialog system for consumer ratings," in Proc. Interspeech, Antwerp, Belgium,2007, pp.2713-2716.
    [5]D. Bonus, S.G. Puerto, D. Huggins-Daines, V. Keri, G. Krishna, R. Kumar, A. Raux, and S. Tomkoohus, "ConQuest:An open-source dialog system for conferences," in Proc. Human Language Technologies:The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT),2007, pp.9-12.
    [6]C.A. Kamm, C.R. Shamieh, and S. Singhai, "Speech recognition issues for directory assistance applications," Speech Comm., vol.17, no.3-4, pp.303-311, 1995.
    [7]M. Lennig, G. Bielby, and J. Massicotte, "Directory assistance automation in Bell Canada:Trial results," in Proc.2nd IEEE Workshop Interactive Voice Technology for Telecommunications Applications (IVTTA94),1994, pp.9-13.
    [8]R. Billi, F. Cainavesio, and C. Rullent, "Automation of Telecom Italia directory assistance service:Fieldtrial results," in Proc. IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications, Torino, Italy,1998, pp.11-16.
    [9]B. Buntschuh, C. Kamm, G.D. Fabbrizio, M.M.A. Abella, S. Narayanan, I. Zeljkovic, R.D. Sharp, J. Wright, S. Marcus, J. Shaffer, R. Duncan, and J.G. Wilponntschuh, "VPQ:A spoken language interface to large scale directory information," in Proc. International Conference on Spoken Language Processing, Sydney, Australia,1998, pp.2863-2866.
    [10]Y. Gao, B. Ramabhadran, J. Chen, H. Erdogan, and M. Picheny, "Innovative approaches for large vocabulary name recognition," in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Salt Lake City, UT,2001, pp.53-56.
    [11]D. Ollason, Y.-C. Ju, S. Bhatia, D. Herron, and J. Liuason, "MS Connect:A fully featured auto-attendant:System design, implementation and performance," in Proc. Int. Conf. Spoken Language Processing, Jeju Island, Korea,2004, pp. 2845-2848.
    [12]A. Acero, R. Chambers, Y.-C. Ju, X. Li, J. Odell, P. Nguyen, O. Scholz, and G. Zweigo, "Live search for mobile:Web services by voice on the cellphone," in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Las Vegas, Nevada,2008.
    [13]王轩.语音文字输入中计算机语言模型的研究.哈尔滨工业大学博士论文,1997.
    [14]王小龙.音字流切分及相互转换的理论研究与系统实现.哈尔滨工业大学博士论文,1989.
    [15]潘凌云,杨长生.拼音汉字计算机转换系统.计算机学报,No.4,1990.
    [16]刘秉权,张凯,王小龙.语音识别中基于规则的语言模型的研究.第五届全国机器人语音通讯学术会议论文集,1998.
    [17]P.R. Clarkson, "Adaptation of Statistical Language Models for Automatic Speech Recognition", PhD thesis, Cambridge University Engineering Department,1999.
    [18]Gale W A, Church K W. "What's wrong with adding one?" In:N.Oostdijk, P. de Haan, ed s. Corpus Based Research into Language. Rodolpi, Amsterdam,1994
    [19]H.Ney, U.Essen, R.Kneser. "On structuring probabilistic dependences in stochastic language modeling". Computer Speech and Language,8:1-38
    [20]H. Ney, D. Mergel, A. Noll, A. Paeseler:"Data Driven Organization of the Dynamic Programming Beam Search for Continuous Speech Recognition", IEEE Trans, on Signal Processing, Vol. SP-40, No.2, pp.272-281, Feb.1992.
    [21]H. Ney:Search Strategies for Large-Vocabulary Continuous-Speech Recognition. NATO Advanced Studies Institute, Bubion, Spain, June-July 1993, pp.210-225, in AJ. Rubio Ayuso, J.M. Lopez Soler (eds.):"Speech Recognition and Coding-New Advances and Trends", Springer, Berlin,1995.
    [22]M. Oerder, H. Ney:"Word Graphs:An Efficient Interface Between Continuous Speech Recognition and Language Understanding", Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Minneapolis, MN, Vol.11, pp.119-122, April 1993.
    [23]V. Steinbiss, B.-H. Tran, H. Ney, "Improvements in Beam Search". Proc. Int. Conf. On Spoken Language Processing, Yokohama, Japan, pp.2143-2146, September 1994.
    [24]张国亮,“大词表连续语音识别中搜索算法的研究与实现”,硕士学位论文,清华大学计算机系
    [25]陈德锋,郑方.动态调整直方图剪枝PDA声控拨号器的应用与实现.电声技术,2005 Vol.12,pp.38-43.
    [26]R. Haeb-Umbach, H. Ney:"Improvements in Time-Synchronous Beam Search for 10000-Word Continuous Speech Recognition", IEEE Trans, on Speech and Audio Processing, Vol.2, pp.353-356, April 1994.
    [27]Ney.H, Haeb-Umbach.R, "Improvements in beam search for 10000-word continuous speech recognition", Acoustics, Speech, and Signal Processing IEEE International Conference, vol.1, pp.9-12,1992
    [28]Leppanen, J, Tian. J, " Dynamic vocabulary prediction for isolated-word dictation on embedded devices", Automatic Speech Recognition& Understanding, pp.556-561.2007
    [29]K. Seymore and Ronald Rosenfeld, "Scalable Backoff Language Models", Proc. Of ICSLP, Vol.1, pp232-235,1996
    [30]F. Jelinek, "Self-Organized Language Modeling for Speech Recognition", in Readings in Speech Recogition, Alex Waibel and Kai-Fu Lee(Eds.), Morgan Kaufmann,1989
    [31]A. Stocke, "Entropy-based Pruning of Backoff Language Models", Proc. Of DARPA Broadcast News Transcription and Understanding Workshop, pp.270-274, 1998
    [32]E. Whittaker, B.Raj, "Quantization-based Language Model Compression", Proc. Eurospeech, pp33-36,2001
    [33]P. Wischel et al., "POS-based Language Models for Large Vocabulary Speech Recognition on Embedded Systems", Proc. Of Interspeech2005, pp.1333-1336
    [34]P.F. Brown, V.J. DellaPietra, P.V. deSouza, J.C. Lai, R.L Mercer, "Class-based n-gram Models of Natural Language", Computational Linguistics, Vol.18, pp.467-479,1990
    [35]R. Kneser and H. Ney, "Improved Clustering Techniques for Class-Based Statistical Language Modelling", Proceedings of the European Conference on Speech Communication and Technology, pp.973-976,1993
    [36]A. Gruenstein, C. Wang and S. Seneff, "Context-Sensitive Statistical Language Modeling", Proceedings of Interspeech 2005-Eurospeech, pp.17-20,2005
    [37]S.M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer", IEEE Trans. On Acoustics, Speech and Signal Processing, Vol.35(3), pp.400-401,1987
    [38]J. Olsen and D. Oria, "Profile Based Compression of N-Gram Language Models," in Proceedings of ICASSP 2006, Toulouse, France, vol 1, pp.1041-1044,2006.
    [39]HTK book