嵌入式人机语音交互系统关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

嵌入式人机语音交互系统关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Key Technologies of Embedded Human-Machine Speech Interaction System
作者：王智国
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：人机交互 ; 嵌入式系统 ; 噪声鲁棒性 ; 识别解码 ; 置信度判决 ; 语音检索
英文关键词：human-machine interaction ; embedded system ; robustness of
英文关键词：noise ; decoding of speech recognition ; confidence measure ; speech
英文关键词：information retrieval
学位年度：2014
导师：戴礼荣
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2014-04-30

摘要

众所周知,语音是人类最自然便捷的交流方式,也是人机交互中最直接的交互模式之一,被普遍认为是下一代人机交互革命的主角。伴随着以智能手机、平板电脑等为代表的嵌入式移动设备的普及,以及语音核心技术和应用环境的逐步成熟,语音交互在全球范围内正在被越来越多的用户接受和使用。然而,由于嵌入式移动设备的功耗和计算资源的限制,以及使用环境的复杂性等因素,使得嵌入式语音人机交互系统的实用化仍然存在很多的问题和挑战。在这一背景下,本文围绕嵌入式人机语音交互系统的关键共性技术问题展开较系统和深入的研究,具体在以下三个方面作出了一定的创新性工作。
     首先,针对语音交互系统识别前端的噪声鲁棒性问题,提出了一种综合考虑加性噪声和信道畸变的模型补偿算法,使用句子中的非语音段估计加性噪声,然后利用EM算法估计信道函数,进而在倒谱域上对失配的声学模型进行联合补偿。算法在噪声环境和信道失配场景下的识别性能均取得显著提升,并且可以动态跟踪环境的变化,性能表现优于一些传统的语音识别噪声鲁棒性算法。
     然后,针对用户在计算资源受限的嵌入式设备上进行中等规模连续语音识别的需求,在语音识别解码模块上提出了一种基于语言模型校正机制的识别解码算法,以基于单树词典的搜索算法替代会导致搜索空间随词典规模指数级增长的传统树状词典拷贝算法,并通过在树状词典的各节点处进行语言模型校正处理的方法来恢复单树词典所产生的搜索错误,在不影响识别性能的前提下使得解码算法复杂度降低了一个数量级。接着,在识别后端置信度模块上提出了一种基于音素聚类子空间的置信度判决算法,通过基于KL度量的音素聚类获取更加紧致的音素子空间,以对置信度得分的规整项进行更加准确的估计,在基本不影响置信度性能的前提下,使得运算复杂度获得了显著下降。
     最后,针对用户对千万量级以上文本列表集进行语音查询的典型需求,提出了一种语音模糊检索的系统解决方案,通过二级倒排索引、分块动态规划,以及识别重排序等算法组合,使得用户只需要输入检索文本列表中的片段、缩略或者其跨序组合即可将与之关联的备选结果查询出来,系统在支持用户以自由语音方式进行输入的同时,具备了相当高的检索性能,明显改善了人机语音的交互体验。
As is known to all, speech is the most natural and convenient way for human communication. It is also one of the most powerful modes for human-machine interaction, which is believed to play a core role in the next generation of human-machine interaction. Thanks to popularity of embedded mobile devices such as smartphone and tablet, along with right time for speech technologies and applications, voice interaction is coming into tens of thousands of people's life all over the world. However, practicability of embedded human-machine voice interaction system still encounters a lot of challenges, such as power consumption of embedded equipment, limitation of computing resources, complex environment for speech recognition, etc. In this context, this dissertation focuses on embedded and interactive human-machine speech system. It provides a systematic and in-depth research on general key technologies with respect to the system and introduces our innovations in the following three aspects.
     First, aiming at robustness issue of speech recognition in the interaction system, this dissertation proposes a model compensation algorithm which takes both additive noise and channel distortions into account. The algorithm first estimates additive noise by using non-speech part in the sentence, following by estimates the channel function using traditional EM algorithm. After that it can perform joint compensation for the mismatched acoustic model, in the cepstral domain. The proposed method achieves significant improvement in speech recognition performance for testing under both noisy environment and scenes with channel distortions. Furthermore, the proposed method can dynamically track variation in the environment, which leads to better experiences than traditional robust algorithms.
     Secondly, the dissertation proposes a novel decoding algorithm based on adjustment of language model, to meet the needs of conducting medium vocabulary continuous speech recognition on embedded equipment with limited resources. The proposed decoding algorithm employs a simplified search mode based on single-tree-style dictionary, instead of traditional search algorithm based on copy-tree-style dictionary, which leads to exponential growth of search space when the size of dictionary goes up. Furthermore, to recover search errors caused by the single-tree-style dictionary, corresponding information in each node of tree-style dictionary will be justified and updated based on language model score. The complexity of proposed method decreases by an order of magnitude with little loss of recognition accuracy. After that, the dissertation also proposes a phonetic clustering based confidence measure algorithm, for the backend module of embedded speech recognition. It generates a more compact phonetic sub-space based on phonetic clustering using KL divergence. The phonetic sub-space can be used to estimate the normalization term of confidence measure score more efficiently and accurately. The proposed method also achieves significant reduction in computational complexity, with little loss in performance of confidence measure.
     Lastly, the dissertation presents a systematical solution for speech retrieval in fuzzy mode, to satisfy demands of speech query among more than ten million entries. The proposed solution refines and integrates various core algorithms such as second-level reverse indexing, block based dynamic programming and re-ranking in speech recognition, etc. By introducing the total solution in practical embedded system, users can obtain expected result more causally by inputting in various modes such as segment-of-text item, abbreviation of item, and various combinations of them. The proposed solution can support users' free voice input and achieve high retrieval performance. Finally it significantly improves user experience for human-machine interaction.

引文

1. Furui, S.,50 Years of Progress in Speech and Speaker Recognition Research. ECTI Trans on Computer and Information Technology,2005.1(2):p.64-74.
    2. N, M.,5.1 Automatic Speech Recognition (ASR) History. Lecture5:Audio Signal Processing in Humans and Machines,1995.
    3. Davis K H, B.R., Balashek S, Automatic Recognition of Spoken Digits. The Journal of the Acoustical Society of America,1952.24(6):p.637-642.
    4. Forgie J W, F.C.D., Results Obtained from A Vowel Recognition Computer Program. The Journal of the Acoustical Society of America,1959.31(11):p. 1480-1489.
    5. Fry D B, D.P., Theoretical Aspects of Mechanical Speech Recognition, The Design and Operation of the Mechanical Speech Recognizer at University College London. Journal British Institution of Radio Engineering,1959.19(4): p.211-229.
    6. K, V.T., Speech Discrimination by Dynamic Programming. Cybernetics and Systems Analysis,1968.4(1):p.81-88.
    7. Martin T B, N.A.L., Zadell H J, Speech Recognition by Feature Abstraction Techniques. Technical report, Air Force Avionics Lab,1964.
    8. R, R.D., Approach to Computer Speech Recognition by Direct Analysis of the Speech Wave. Technical report, Stanford Univ.,1966.
    9. Suzuki J, N.K., Recognition of Japanese Vowels-Preliminary to The Recognition of Speech. Journal Radio Research Laboratory,1961.37(8):p. 193-212.
    10. Nagata K, K.Y., Chiba S, Spoken Digit Recognizer for Japanese Language. NEC Res. Develop,1963.6.
    11. Sakai T, D.S., The Phonetic Typewriter Information Processing 1962, in Proceedings of IFIP Congress1962. p.445-450.
    12. H, K.D., Review of the ARPA Speech Understanding Project. The Journal of the Acousti-cal Society of America,1977.62(6):p.1345-1366.
    13. D, E.L., Overview of the Hearsay Speech Understanding Research. ACM SIGART Bulletin,1976(56):p.9-16.
    14. B, L., The Harpy Speech Understanding System. Morgan Kaufmann Publishers Inc.,1990:p.576-586.
    15. Velichko V M, Z.N.G., Automatic Recognition of 200 Words. International Journal of Man-Machine Studies,1970.2(2):p.223.
    16. Sakoe H, C.S., Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Trans, on Acoustics, Speech and Signal Processing, 1978.26(1):p.43-49.
    17. F, I., Minimum Prediction Residual Applied to Speech Recognition. IEEE Trans, on Acoustics, Speech and Signal Processing,1975.23(1):p.67-72.
    18. Itakura F, S.S., A Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies. Electronics and Communications in Japan,1970. 53(A):p.36-43.
    19. Rabiner L R, L.S.E., Rosenberg A E, et al, Speaker Independent Recognition of Isolated Words Using Clustering Techniques. IEEE Trans, on Acoustics, Speech and Signal Processing,1979.27(4):p.336-349.
    20. Tappert C C, D.N.R., Rabinowitz A S, et al, Automatic Recognition of Continuous Speech Utilizing Dynamic Segmentation, Dual Classification, Sequential Decoding and Error Recovery. Technical report, Rome Air Development Center,1971.
    21. Jelinek F, B.L., Mercer R, Design of A Linguistic Statistical Decoder for The Recognition of Continuous Speech. IEEE Trans. on Information Theory,1975. 21(3):p.250-256.
    22. Schwartz R, C.Y., Kimball O, et al, Context-Dependent Modeling for Acoustic-Phonetic Recognition of Continuous Speech. Proceedings of ICASSP1985,1985.10:p.1205-1208.
    23. Juang B H, L.S., Sondhi M, Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains. IEEE Trans. on Information Theory, 1986.32(2):p.307-309.
    24. L, R., A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE,1989.77(3):p.257-286.
    25. S, F., Speaker Independent Isolated Word Recognition Using Dynamic Features of Speech Spectrum. IEEE Trans. on Acoustics, Speech and Signal Processing,1986.34(1):p.52-59.
    26. F, J., The Development of An Experimental Discrete Dictation Recognizer. IEEE,1985.73(11):p.1616-1624.
    27. H, S., Two Level DP Matching-A Dynamic Programming Based Pattern Matching Algorithm for Connected Word Recognition. IEEE Trans. on Acoustics, Speech and Signal Processing,1979.27(6):p.588-595.
    28. Myers C S, R.L.R., A Level Building Dynamic Time Warping Algorithm for Connected Word Recognition. IEEE Trans, on Acoustics, Speech and Signal Processing,1981.29(3):p.284-297.
    29. Lee C H, R.L.R., A Frame Synchronous Network Search Algorithm for Connected Word Recognition. IEEE Trans, on Acoustics, Speech and Signal Processing,1989.37(11):p.1649-1658.
    30. F, L.K., Large-Vocabulary Speaker-Independent Continuous Speech Recognition:The Sphinx System[D]. Carnegie Mellon University,1988.
    31. Lee K F, H.H.W., Reddy R, An Overview of The SPHINX Speech Recognition System. IEEE Trans, on Acoustics, Speech and Signal Processing,1990.38(1): p.35-45.
    32. Chow Y, D.M., Kimball O, et al, BYBLOS:The BBN Continuous Speech Recognition System. Proceedings of ICASSP1987,1987.12:p.89-92.
    33. Weintraub M, M.H., Cohen M, et al, Linguistic Constraints in Hidden Markov Model Based Speech Recognition. Proceedings of ICASSP,1989:p.699-702.
    34. Juang B H, F.S., Automatic Speech Recognition and Understanding of Spoken Language:A First Step Toward Natural Human-machine Communication. IEEE,2000.88(8):p.1142-1165.
    35. Bahl L R, B.P.F., Souza P V, et al, Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition. Proceedings of ICASSP1986,1986:p.49-52.
    36. Juang B H, K.S., Discriminative Learning for Minimum Error Classification. IEEE Trans, on Signal Processing,1992.40(12):p.3043-3054.
    37. Woodland P C, P.D., Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition. Computer Speech and Language,2002.16(1): p.25-47.
    38. Juang B H, C.W., Lee C H, Minimum Classification Error Rate Methods for Speech Recognition. IEEE Trans, on Speech and Audio Processing,1997.5(3): p.257-265.
    39. Chou W, J.B., Lee C, Segmental GPD Training of HMM Based Speech Recognizer. Proceedings of ICASSP1992,1992.1:p.473-476.
    40. Katagiri S, J.B.H., Lee C H, Pattern Recognition Using a Family of Design Algorithms Based Upon the Generalized Probabilistic Descent Method. Proceedings of the IEEE,1998.86(11):p.2345-2373.
    41. Gales M J F, Y.S.J., Parallel Model Combination for Speech Recognition in Noise. Technical report, University of Cambridge:Department of Engineering, 1993.
    42. Gauvain J L, L.C.H., Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Trans, on Speech and Audio Processing,1994.2(2):p.291-298.
    43. Leggetter C, W.P., Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models. Computer Speech and Language,1995.9(2):p.171-185.
    44. Young S, O.J., Woodland P, Tree-Based State Tying for High Accuracy Acoustic Modelling. Proceedings of ARPA Workshop on Human Language Technology,1994:p.307-312.
    45. J, O., The Use of Context in Large Vocabulary Speech Recognition[D]. Cambridge University,1995.
    46. Huang X, A.A., Alleva F, et al, Microsoft Windows Highly Intelligent Speech Recognizer:Whisper. Proceedings of ICASSP1995,1995.1:p.93-96.
    47. Liu Y, S.E., Stolcke A, et al, Structural Metadata Research in the EARS Program. Proceedings of ICASSP,2005:p.957-960.
    48. Soltau H, K.B., Mangu L, et al, The IBM 2004 Conversational Telephone System for Rich Transcription. Proceedings of ICASSP,2005:p.205-208.
    49. Soltau H, S.G., Kingsbury B, et al, The IBM 2006 GALE Arabic ASR System. Proceedings of ICASSP 2007,2007.4:p.349-353.
    50. Povey D, W.P.C., Minimum Phone Error and I-Smoothing for Improved Discriminative Training. Proceedings of ICASSP,2002:p.105-108.
    51. Macherey W, H.L., Schluter R, et al, Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition. Proceedings of Eurospeech,2005:p.2133-2136.
    52. G, Z., Speech Recognition with Dynamic Bayesian Networks[D]. University of California, Berkeley,1998.
    53. Kawahara T, L.C.H., Juang B H, Key-phrase Detection and Verification for Flexible Speech Understanding. Proceedings of ICSLP,1996:p.861-864.
    54. G. E. Hinton, S.O., and Y. Teh, A fast learning algorithm for deep belief nets. Neural Comput,2006.18:p.1527-1554.
    55. A. Mohamed, G.D., and G. Hinton, Acoustic modeling using deep belief networks. Audio Speech Lang. Processing,2012.20(1):p.14-22.
    56. G. Dahl, D.Y., L. Deng, and A. Acero, Context-dependent pretrained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Processing,2012.20(1):p.30-42.
    57. D. Yu, L.D., and G. Dahl, Roles of pretraining and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. Proc. NIPS Workshop Deep Learning and Unsupervised Feature Learning,2010.
    58. F. Seide, G.L., and D. Yu, Conversational speech transcription using context-dependent deep neural networks. Proc. Interspeech,2011:p.437-440.
    59. Lamel, L., Rabiner, L., Rosenberg, A., and Wilpon, J, An improved endpoint detector for isolated word recognition. IEEE Trans. on Acoustics, Speech and Signal Processing,1981.29:p.777-785.
    60. J. Sohn, N.S.K.a.W.S., A statistical modelbased voice activity detection. IEEE Signal Processing Letters,1999.6(1):p.1-3.
    61. Davis S, M.P., Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans, on Acoustics, Speech, and Signal Processing,1980.28(4):p.357-366.
    62. H, H., Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustical Society of America,1990.87(4):p.1738-1752.
    63. Hunt M, L.C., A Comparison of Several Acoustic Representations for Speech Recognition with Degraded and Undegraded Speech. Proceedings of ICASSP, 1989:p.262-265.
    64. S, B., Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE Trans. on Acoustics, Speech, and Signal Processing,1979.27(2):p. 113-120.
    65. V, C.D., Increased Noise Immunity in Large Vocabulary Speech Recognition with The Aid of Spectral Subtraction. Proceedings of ICASSP,1987:p. 1143-1146.
    66. Lockwood P, B.J., Experiments with A Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and The Projection, for Robust Speech Recognition in Cars. Speech Communication,1992.11(2-3):p.215-228.
    67. S, A.B., Effectiveness of Linear Prediction Characteristics of The Speech Wave for Automatic Speaker Identification and Verification. Journal of the Acoustical Society of America,1974.55(6):p.1304-1312.
    68. Ephraim Y, T.H.L.V., A Signal Subspace Approach for Speech Enhancement. IEEE Trans. on Speech and Audio Processing,1995.3(4):p.251-266.
    69. Hermus K, W.P., Assessment of Signal Subspace Based Speech Enhancement for Noise Robust Speech Recognition. Proceedings of ICASSP,2004:p. 945-948.
    70. J, M.P., Speech Recognition in Noisy Environments[D]. Carnegie Mellon University,1996.
    71. A, A., Acoustic and Environmental Robustness in Automatic Speech Recognition[D]. Carnegie Mellon University,1990.
    72. L.Lee, R.R., A frequency warping approach to speaker normalization. IEEE, Transactions on S&AProcessing,1998:p.49-60.
    73. Baum L E, E.J.A., An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology. Bulletin of American Mathematical Society,1967(73):p.360-363.
    74. Brown P F, P.S.D., Pietra V J D, et al, The Mathematic of Statistical Machine Translation:Parameter Estimation. Computational Linguistics,1993.19(2):p. 263-311.
    75. J, D.S., Grammatical Category Disambiguation by Statistical Optimization. Computational Linguistics,1988.14(1):p.31-39.
    76. Tokuda K, Y.T., Masuko T, et al, Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis. Proceedings of ICASSP2000,2000.3:p. 1315-1318.
    77.李小兵,高效简约的语音识别声学模型[博士学位论文].中国科学技术大学,2006.
    78. S, K., Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans, on Acoustics, Speech, and Signal Processing,1987.35(3):p.400-401.
    79. Ney H, E.U., Kneser R, On Structuring Probabilistic Dependences in Stochastic Language Modelling. Computer Speech and Language,1994.8(1): p.1-38.
    80. T Mikolov, M.K., L Burget, J Cernocky, S Khudanpur, Recurrent neural network based language model. Proc. Interspeech,2010:p.1045-1048.
    81. A, V., Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans, on Information Theory,1967.13(2):p. 260-269.
    82. Ney H, M.D., Noll A, et al, A Data-Driven Organization of the Dynamic Programming Beam Search for Continuous Speech Recognition. Proceedings of ICASSP1987,1987.12:p.833-836.
    83. E, B., Vector Quantization for the Efficient Computation of Continuous Density Likelihoods. Proceedings of ICASSP1993,1993.2:p.692-695.
    84. Ney H, H.-U.R., Tran B H, et al, Improvements in Beam Search for 10000-Word Continuous Speech Recognition. Proceedings of ICASSP,1992:p. 9-12.
    85. Steinbiss V, T.B.H., Ney H, Improvements in Beam Search. Proceedings of ICSLP1994,1994:p.2143-2146.
    86. F, J., Fast Sequential Decoding Algorithm Using a Stack. IBM Journal of Research and Development,1969.13(6):p.675-685.
    87. Mohri M, P.F., Riley M, Weighted Finite-state Transducers in Speech Recognition. Computer Speech and Language,2002.16(1):p.69-88.
    88. Schwartz R, C.Y.L., The N-Best Algorithms:An Efficient and Exact Procedure for Finding the N Most Likely Sentence Hypotheses. Proceedings of ICASSP1990,1990.1:p.81-84.
    89. Soong F, H.E.F., A Tree-Trellis Based Fast Search for Finding the N-Best Sentence Hypotheses in Continuous Speech Recognition. Proceedings of ICASSP1991,1991.1:p.705-708.
    90. Ney H, A.X., A Word Graph Algorithm for Large Vocabulary, Continuous Speech Recognition. Proceedings of ICSLP1994,1994:p.1355-1358.
    91. Fiscus, J., A post-processing system to yield reduced word error rates: Recognizer output voting error reduction(ROVER). Proc, IEEE ASRU,1997: p.347-352.
    92. G. Evermann, P.C.W., POSTERIOR PROBABILITY DECODING, CONFIDENCE ESTIMATION AND SYSTEM COMBINATION. NIST Speech Transcription Workshop,2000.
    93. Microsoft, http://www.microsoft.com/en-us/kinectforwindowsdev/Start.aspx. 2010.
    94. iFLYTEK, http://www.voicecloud.cn/index.html.2010.
    95. APPLE, http://www.apple.com/ios/siri/.2011.
    96. Hermansky H, B.N.M.A., Kohn P, Compensation for The Effect of The Communication Channel in Auditory-like Analysis of Speech (RASTA-PLP). Proceedings of EUROSPEECH,1991:p.1367-1370.
    97. Cheng Y M, O.S.D., IEEE Trans. on Signal Processing,. Speech Enhancement Based Conceptually on Auditory Evidence.39(9):p.1943-1954.
    98. Alexandre P, L.P., Root Cepstral Analysis:A Unified View. Application to Speech Processing in Car Noise Environments. Speech Communication,1993. 12(3):p.277-288.
    99. V, C.D., Increased Noise Immunity in Large Vocabulary Speech Recognition with The Aid of Spectral Subtraction. Proceedings of ICASSP,1987:p. 1143-1146.
    100. Agarwal A, C.Y.M., Two-stage Mel-warped Wiener Filter for Robust Speech Recognition. Proceedings of ASRU,1999:p.67-70.
    101. Deng L, A.A., Jiang L, et al, High-performance Robust Speech Recognition Using Stereo Training Data. Proceedings of ICASSP,2001:p.301-304.
    102. Varga A P, M.R.K., Hidden Markov Model Decomposition of Speech and Noise. Proceedings of ICASSP,1990:p.845-848.
    103. F, G.M.J., Model-Based Techniques for Noise Robust Speech Recognition[D]. University of Cambridge,1995.
    104. Alex Acero, L.D., Trausti Kristjansson and Jerry Zhang, HMM Adaptation using vector taylor series for noisy speech recognition. Proceedings of International Conference on Spoken Language Processing (ICSLP),2000:p. 869-872.
    105. A.P.Dempster, N.M.L.a.D.B.R., Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc,1977.39(1):p.1-38.
    106. XL.Aubert, An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech & Language,2002.16(1):p. 89-114.
    107. H.Ney, Dynamic programming parsing for context-free grammars in continuous speech recognition. Signal Processing, IEEE Transactions,1991. 39(2):p.336-340.
    108. Mehryar Mohria, F.P., Weighted finite-state transducers in speech recognition. Computer Speech & Language.16(1):p.69-88.
    109. Bahl L R, J.F., Mercer R L, A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence,1983.5:p.179-190.
    110. Woodland P C, O.J.J., Valtchev V, Young S J, Large vocabulary continuous speech recognition using HTK. Proceedings of ICASSP 1994,1994.2:p. Ⅱ/125-Ⅱ/128.
    111. Thomas Colthurst, O.K., Fred Richardson, Han Shu, Chuck Wooters, Rukmini Iyer, Herbert Gish, The 2000 BBN Byblos LVCSR System. Proceedings of ICSLP2000,20000.2:p.1011-1014.
    112. David Rybach, C.G., Georg Heigold, Bjorn Hoffmeister, Jonas Loof, Ralf Schluter, Hermann Ney, The RWTH Aachen University Open Source Speech Recognition System. Proceedings of InterSpeech2009,2009:p.2111-2114.
    113. Xin Lei, W.W., Wen Wang, Arindam Mandal, Andreas Stolcke, Development of the 2008 SRI Mandarin Speech-to-Text System for Broadcast News and Conversation. Proceedings of InterSpeech2009,2009:p.2099-2102.
    114. Haeb-Umbach R, H.N., Improvements in time-synchronous beam search for 10000-word ontinuous speech recognition. IEEE Trans. Speech Audio Processing,1994.2(2):p.353-356.
    115. Ortmanns S, E.A., Ney H, Coenen N, Look-ahead techniques for fast beam search. Proceedings of ICASSP1997,1997.3:p.1783-1786.
    116. Ortmanns S, N.H., Eiden A, Language-model look-ahead for large vocabulary speech recognition. Proceedings of ICSLP 1996,1996.4:p.2095-2098.
    117. Ney H, O.S., Progress in dynamic programming search for LVCSR. Proceedings of the IEEE,2000.88(8):p.1224-1240.
    118. J. J. Odell, V.V., P. C. Woodland,S. J. Young, A one pass decoder design for large vocabulary recognition. Proceeding HLT'94 Proceedings of the workshop on Human Language Technology,1994:p.405-410.
    119. Long Nguyen, S.R., Single-tree method for grammar-directed search. Proceedings of ICASSP1999,1999.2:p.613-616.
    120. H, J., Confidence Measures for Speech Recognition:A Survey. Speech Communication,2005.45(4):p.455-470.
    121. Young, S.R., Detecting misrecognitions and out-ofvocabulary words. Proc. of International Conference on Acoustics, Speech and Signal Processing,1994:p. Ⅱ-21-Ⅱ-24.
    122. Kamppari, S.O., Hazen, T.J, Word And Phone Level Acoustic Confidence ScoringWord And Phone Level Acoustic Confidence Scoring. Proc. of International Conference on Acoustics, Speech and Signal Processing,2000:p. 1799-1802.
    123. Wessel F J, S.R., Macherey K, et al, Confidence Measures for Large Vocabulary Continuous Speech Recognition. IEEE Trans, on Speech and Audio Processing,2001.9(3):p.288-298.
    124. P. Liu, F.K.S., J. L. Zhou, Effective Estimation of Kullback-Leibler Divergence between Speech Models. Microsoft Research Asia, Technical Report,2005.
    125. Stogel, S.S., Voice recognition dialing system. US Patent US 5483579 A,1994.
    126. Mark Lucas, S.M., Steven A. Bennington, Method and apparatus for voice dictation and document production. US Patent US 6834264 B2,2001.
    127. iFLYTEK. http://www.iflytek.com.1999.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700