高效简约的语音识别声学模型

英文题名：Towards High Performance and Parsimonious Acoustic Modeling in Speech Recognition
作者：李小兵
论文级别：博士
学科专业名称：信号与信息处理
学位年度：2006
导师：王仁华 ; 宋謌平
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2006-04-01

摘要

当前连续密度HMM模型的语音识别系统性能良好，但其存储和计算需求过大。针对这一问题，本论文专注于语音识别系统的核心——声学模型。本文分别从训练方法、特征降维、模型参数压缩三个方面研究如何获得高效小巧的声学模型，在保证模型精度的前提下使用尽小可能的参数量，降低系统资源需求。基于已有的方法，我们提出及推广了一系列新方法，以实验证明了它们的有效性。这些方法主要集中在以下几个方面。
     首先，本文研究了最小分类错误方法，实现了基于N-best解码的训练方法。实验证实，在保证模型精度的前提下，经MCE训练的模型可显著减小。我们并将其推广到子空间分布聚类HMM模型上，在很大程度上弥补了在将CDHMM转换成SDCHMM的过程中由于特征空间分裂和子空间分布聚类带来的性能降低。与直接由CDHMM转换而成的SDCHMM相比，性能提升15-80％。
     其次，为了解决特征降维方法通常也降低识别性能的问题，我们提出了在区分性特征提取框架下按照最小分类错误准则调整模型参数和特征降维变换的方法，效果极为明显。更进一步，我们提出了以LDA变换执行的集去相关与降维于一体的新的特征提取方法，并将该方法同样纳入区分性特征提取框架之中。利用该方法，14维特征获得了与39维MFCC同样的性能，显著降低了计算和存储的需求。
     再次，针对声学模型中各个状态对系统性能的贡献不同，提出了以贪心算法实现的基于似然度、Kullback-Leibler散度和状态间分散度的HMM模型各状态高斯分布数的确定方法。在总高斯分布数目给定前提下，分别最大化训练数据的似然度，最小化当前模型与“真正”模型之间的距离和最大化模型各状态间之分散度。其中基于状态间分散度的方法融入了状态间的竞争信息，具有区分性的特性。实验结果表明这几种方法相较基于贝叶斯信息准则的方法性能更佳。在相同模型精度的前提下，都可不同程度地减少参数。
     最后，本文对声学模型特征级参数聚类进行了研究。在进行特征级参数聚类时我们提出采用具有信息熵意义的KLD作LBG聚类，聚类性能良好。而基于不同维的特征区分性信息多寡的不同，我们分别提出了各标量维高斯核的基于KLD和似然度的非均一分配法。在总高斯核数不变原则下，利用贪心算法在不同维之间进行高斯核的优化分配来最小化压缩模型与原始模型间的KLD和最大化训练数据的似然度。这两种非均一分配方法比均一分配性能更佳。而基于似然度的方法又优于基于KLD的方法。这些方法在保证模型性能基本不降的同时将模型参数压缩到原来的15％左右。此时加减需求为原来的50％左右，而乘除的需求则可大幅减少为1％以内。对于孤立词任务，相应的乘除运算更降到未压缩模型的0.05％左右。
Current state-of-the-art, continuous density HMM-based large vocabulary speech recognition system delivers a fairly decent recognition performance in a benign environment but usually at a price of large memory and high computation complexities. In this thesis we explore the possibilities to obtain parsimonious acoustic model while maintaining the same performance as the complex model. They are explored in: 1) training algorithm; 2) dimensionality reduction; 3) model compression. Novel and efficient algorithms are proposed.
    In model training, the N-best based minimum classification error training is developed. Experimental results show that a high performance, parsimonious model can be obtained. This MCE is then extended to optimize subspace distribution clustering HMM. Experimental results show that performance degradation resulted from converting CDHMM to SDCHMM can be recovered and 15-80% word error rate reduction is obtained.
    In dimensionality reduction, we jointly optimize feature reduction transformation and the model parameters with MCE criterion. A new feature extraction, which uses LDA to perform feature decorrelation and dimensionality reduction, is proposed and developed into a discriminative feature extraction framework. A 14-dimension features gives almost the same performance as the 39-dimension MFCC features.
    In model compression, we found that different states contribute non-uniformly to recognition. Likelihood, Kullback-Leibler divergence, and state divergence are used to allocate Gaussian components to HMM states. The state divergence-based approach considers the discrimination of states. A greedy search is proposed to optimize Gaussian component allocation. Compared with Bayesian information criterion-based determination, the proposed approaches show improved performance.
    Also, we study feature-level model compression. Optimal clustering and non-uniform allocation of Gaussian kernels in the scalar feature dimension are proposed. Symmetric KLD is adopted to cluster Gaussian kernel, and KLD-based and likelihood-based non-uniform allocation are developed by using a Greedy search. Our non-uniform allocation gives better performance than uniform allocation, especially at larger compression ratios; likelihood-based allocation also outperforms KLD-based one. With almost negligible recognition performance degradation, the original HMMs can be compressed to 15% of its original size, which needs about 1% of the original multiplication/division operations. For the isolated-word recognition task tested, the multiplication/division operations can be further reduced to 0.05%.

引文

[Ana96] Anastasakos, T., McDonough, J., Schwartz, R., & Makhoul, J., "A Compact Model for Speaker-Adaptive Training", In Proceedings of the 4th International Conference on Spoken Language Processing, vol. 2, pp. 1137-1140, Philadelphia, PA, 1996.
    [Att00] Attias, H., "A Variational Bayesian Framework for Graphical Models", In Solla, S. A., Leen, T. K., & Muller, K., eds., Advances in Neural Information Processing Systems, vol. i2, pp. 209-215, MIT Press, Cambridge, MA, 2000.
    [Bah86] Bahl, L. R., Brown, P. F., deSouza, P. V., & Mercer, R. L., "Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 49-52, 1986.
    [Bak75] Bake'r, J. K., "The DRAGON System-An Overview", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 24-29, February 1975.
    [Bat98] Batlle, E., Nadeu, C., & Fonollosa, J. A. R., "Feature Decorrelation Methods in Speech Recognition. A Comparative Study", In Proceedings of the 5th International Conference on Spoken Language Processing, pp. 951-954, Sydney, Australia, 1998.
    [Bau70] Bantu, L. E., Petrie, T., Soules, G., & Weiss, N., "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains", Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164-171, 1970.
    [Bau72] Baum, L. E., "An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes", Inequalities, vol. 3, pp. 1-8, 1972.
    [Bel89] Bellegarda, J. R., & Nahamoo, D., "Tied Mixture Continuous Parameter Models for Large Vocabulary Isolated Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 13-16, Glasgow, 1989.
    [Bie01] Biem, A., Katagiri, S., McDermott, E., & Juang, B.-H., "An Application of Discriminative Feature Extraction to Filter-Bank-based Speech Recognition", IEEE Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 96-110, February 2001.
    [Boc92] Bocchieri, E. L., & Wilpon, J. G., "Discriminative Analysis for Feature Reduction in Automatic Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 50t-504, San Francisco, CA, 1992.
    [Boc01] Bocchieri, E., & Mak, B., "Subspace Distribution Clustering Hidden Markov Model", IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 264-275, March 2001.
    [Cha00] Chart, Y.-C., Siu, M., & Mak, B., "Pruning of State-Tying Tree using Bayesian Information Criterion with Multiple Mixtures", In Proceedings of the 6th International Conference on Spoken Language Processing, vol. 4, pp. 294-297, Beijing, China, 2000.
    [Che94] Chen, J.-K., & Soong, F. K., "An N-Best Candidates-Based Discriminative Training for Speech Recognition Applications", IEEE Transactions on Speech and Audio Processing, vol. 2, no. 1, pp. 206-216, January 1994.
    [Che97a] Chengalvarayan, R., & Dang, L., "HMM-Based Speech Recognition Using State-Dependent, Discriminatively Derived Transforms on Mel-Warped DFT Features", IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 243-256, May 1997.
    [Che97b] Chengalvarayan, R., & Dang, L., "Use of GeneraIized Dynamic Feature Parameters for Speech Recognition", IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 232-242, May 1997.
    [Che99] Chen, S. S., & Gopinath, R. A., "Model Selection in Acoustic Modeling", In Proceedings of the 6th European Conference on Speech Communication and Technology, vol. 3, pp. 1087-1090, Budapest, Hungary, 1999.
    [Cho93] Chou, W., Lee, C.-H., & Juang, B.-H., "Minimum Error Rate Training Based on N-best String Models", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 652-655, Minneapolis, MN, 1993.
    [Cho00] Chou, W., "Discriminant-Function-Based Minimum Recognition Error Rate Pattern-Recognition Approach to Speech Recognition", Proceedings of the IEEE, vol. 88, no. 8, pp. 1201-1223, August 2000.
    [Coh94] Cohen, J., Kamm, T., & Andreou, A., "An Experiment in Systematic Speaker Variability", In Final Day Review, DOD Speech Workshop on Robust Speech Recognition, 1994.
    [Dav52] Davis, K. H., Biddulph, R., & Balashek, S., "Automatic Recognition of Spoken Digits", Journal of the Acoustical Society of America, vol. 24, no. 6, pp. 637-642, 1952.
    [Dav80] Davis, S., & Mermelstein, P., "Comparison of Parametric Representations for Monosyllable Word Recognition in Contiunously Spoken Sentences", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, August 1980.
    [Dem77] Dempster, A. P., Laird, N. M., & Rubin, D. B., "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977.
    [Dud01] Duda, R. O., Hart, P. E., & Stork, D. G., Pattern Classification, 2nd Edition, John Wiley & Sons, 2001.
    [Eph89] Ephraim, Y., Dembo, A., & Rabiner, L. R., "A Minimum Discrimination Information Approach for Hidden Markov Modeling", IEEE Transactions on Information Theory, vol. 35, no. 5, pp. 1001-1013, September 1989.
    [ETS00] ETSI, Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithm, ES 201 108 v1.1.2, European Telecommunications Standards Institute, 2000.
    [ETS02] ETSI, Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithm, ES 202 050 v1.1.1, European Telecommunications Standards Institute, 2002.
    [Fer80] Ferguson, J. D., ed., Hidden Markov Models for Speech, IDA-CRD, Princeton, NJ, 1980.
    [For59] Forgie, J. W., & Forgie, C. D., "Results Obtained From a Vowel Recognition Computer. Program", Journal of the Acoustical Society of America, vol. 31, no. 11, pp. 1480-1489, November 1959.
    [For73] Forney, J. G. D., "The Viterbi Algorithm", Proceedings of the IEEE, vol. 61, no. 3, pp. 268-278, March 1973.
    [Fur81] Furui, S., "Cepstral Analysis Technique for Automatic Speaker Verification", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 2, pp. 254-272, April 1981.
    [Gal99] Gales, M. J., "Semi-tied Covariance Matrices for Hidden Markov Models", IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, May 1999.
    [Gal00] Gales, M. J. F., "Cluster Adaptive Training of Hidden Markov Models", IEEE Transactions on Speech and Audio Processing, vol. 8, no. 4, pp. 417-428, July 2000.
    [Gao99] Gao, Y., Jan, E.-E., Padmanabhan, M., & Picheny, M., "HMM Training Based on Quality Measurement", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 129-132, Phoenix, AZ, 1999.
    [Gau94] Gauvain, J.-L., & Lee, C.-H., "Maximum A Posteriori Estimation for Multivariate Glaussian Mixture Observations of Markov Chains", IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 291-298, April 1994.
    [Gol03] Goldberger, J., Gordon, S., & Greenspan, H., "An Efficient Image Similarity Measure based on Approximations of KL-Divergence Between Two Gaussian Mixtures", In Proceedings of the 9th IEEE International Conference on Computer Vision, vol. 1, pp. 487-493, Nice, France, 2003.
    [Gop98] Gopinath, R. A., "Maximum Likelihood Modeling with Gaussian Distributions for Classification", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 661-664, Seattle, WA, 1998.
    [Got94] Gotoh, Y., Hochberg, M. M., & Silverman, H. F., "Using MAP Estimated Parameters to. Improve HMM Speech Recognition Performance", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I/229-I/232, Adelaide, SA, 1994.
    [Gre96] Greenberg, S., "Understanding Speech Understanding Towards a Unified Theory of Speech Perception", In Ainsworth, W. A., & Greenberg, S., eds., Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, pp. 1-8, Keele University, UK, 1996.
    [Her90] Hermansky, H., "Perceptual Linear Predictive (PLP) Analysis of Speech", Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752, April 1990.
    [Hof00] Hoffbeck, J. P., & Landgrebe, D., "A Method for Estimating the Number of Components in a Normal Mixture Density Function", In Proceedings of the IEEE International Symposium on Geoscience and Remote Sensing, vol. 4, pp. 1675-1677, Honolulu, HI, 2000.
    [Hua89] Huang, X., & Jack, M. A., "Semi-continuous Hidden Markov Models for Speech Signals", Computer Speech and Language, vol. 3, no. 3, pp. 239-251, July 1989.
    [Hua01] Huang, X., Acero, A., & Hon, H.-W., Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR, 2001.
    [Hwa93] Hwang, M.-Y., & Huang, X., "Shared-Distribution Hidden Markov Models for Speech Recognition", IEEE Transactions on Spe.ech and Audio Processing, vol. 1, no. 4, pp. 414-420, October 1993.
    [Ita68] Itakura, F., & Saito, S., "Analysis Synthesis Telephony Based on the Maximum Liketihood Method", In Kohasi, Y., ed., Proceedings of the 6th International Congress on Acoustics, pp. C17-C20, Tokyo, Japan, 1968.
    [Jel76] Jelinek, F., "Continuous Speech Recognition by Statistical Methods", Proceedings of the IEEE, vol. 64, no. 4, pp. 532-556, April 1976.
    [Jel97] Jelinek, F., Statistical Methods for Speech Recognition, The MIT Press, Cambridge, MA, 1997.
    [Jia00] Jia, Y., Yan, Y., & Yuan, B., "Dynamic Threshold Setting via Bayesian Information Criterion (BIC) in HMM training", In Proceedings of the 6th International Conference on Spoken Language Processing, vol. 4, pp. 169-171, Beijing, China, 2000.
    [Jia02] Jiang, H., Siohan, O., Soong, F. K., & Lee, C.-H., "A Dynamic in-search Discriminative Training Approach for Large Vocabulary Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I-113-I-116, Orlando, Florida, 2002.
    [Jua85] Juang, B.-H., & Rabiner, L. R., "A Probabilistic Distance Measure for Hidden Markov Models", AT(?)T Technical Journal, vol. 64, no. 2, pp. 391-408, February 1985.
    [Jua92] Juang, B.-H., & Katagiri, S., "Discriminative Learning for Minimum Error Classification", IEEE Transactions on Signal Processing, vol. 40, no. 12, pp. 3043-3054, December 1992.
    [Jua97] Juang, B.-H., Chou, W., & Lee, C.-H., "Minimum Classification Error Rate Methods for Speech Recognition", IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257-265, May 1997.
    [Jul96] Juliet, S., & Uhlmann, J. K., "A General Method for Approximating Nonlinear Transformations of Probability Distributions", Tech. rep., Robotics Research Group, Department of Engineering Science, University of Oxford, Novermber 1996.
    [Kat93] Katagiri, S., Biem, A., & Juang, B.-H., "Discriminative Feature Extraction", In Mammone, R. J., ed., Artificial Neural Networks for Speech and Vision, chap. 18, pp. 278-293, Chapman & Hall, London, U. K., 1993.
    [Kim99] Kim, J., Haimi-Cohen, R., & Soong, F. K., "Hidden Markov Models with Divergence based Vector Quantized Variances", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 125-128, Phoenix, AZ, 1999.
    [Kim03] Kim, J., & Chung, J., "Reduction of Dimension of HMM Parameters Using ICA and PCA in MLLR Framework for Speaker Adaptation", In Proceedings of the 8th European Conference on Speech Communication and Technology, pp. 1461-1464, Geneva, Switzerland, 2003.
    [Kul97] Kullback, S., Information Theory and Statistics, Dover Publications, 1997.
    [Kum98] Kumar, N., & Andreou, A. G., "Heteroscedastic Discriminant Analysis and Reduced Rank HMMs for Improved Speech Recognition", Speech Communication, vol. 26, no. 4, pp. 283-297, December 1998.
    [Lee89] Lee, K.-F., Automatic Speech Recognition-The Development of the SPHINX System, Kluwer Academic Publishers, Boston, MA, 1989.
    [Lee90a] Lee, K.-F., "Context-Dependent Phonetic Hidden Markov Models for Speaker-Independent Continuous Speech Recognition", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 4, pp. 599-609, April 1990.
    [Lee90b] Lee, K.-F., Hayamizu, S., Hon, H.-W., Huang, C., et al., "Allophone Clustering for Continuous Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749-752, Albuquerque, NM, 1990.
    [Lee90c] Lee, K.-F., Hon, H.-W., & Reddy, D. R., "An Overview of the SPHINX Speech Recognition System", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 1, pp. 35-45, January 1990.
    [Lee93] Lee, L.-S., Tseng, C. Y., Gu, H.-Y., Liu, F.-H., et al., "Golden Mandarin(Ⅰ)-A Real-time Mandarin Speech Dictation Machine for Chinese Language with Very Large Vocabulary", IEEE Transactions on Speech and Audio Processing, vol. 1, no. 2, pp. 158-179, April 1993.
    [Lee96] Lee, L., & Rose, R. C., "Speaker Normalization Using Efficient Frequency Warping Procedures", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 353-356, Atlanta, GA, 1996.
    [Leg94] Leggetter, C. J., & Woodland, P. C., "Speaker Adaptation of HMMs Using Linear Regression", Tech. Rep. CUED/F-INFENG/TR. 181, Cambridge University Engineering Department, June 1994.
    [Leo84] Leonard, R. G., "A Database for Speaker-Independent Digit Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 328-331, 1984.
    [Lev86] Levinson, S. E., "Continuously Variable Duration Hidden Markov Models for Automatic Speech Recognition", Computer Speech and Language, vol. 1, no. 1, pp. 29-45, March 1986.
    [Li04a] Li, X.-B., Dai, L.-R., & Wang, R.-H., "MCE-based Training of Subspace Distribution Clustering HMM", In Proceedings of the 4th International Symposium on Chinese Spoken Language Processing, pp. 113-116, Hongkong, China, 2004.
    [Li04b] Li, X.-B., Li, J.-Y., & Wang, R.-H., "Dimensionality Reduction using MCE-optimized LDA Transformation", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 137-140, Montreal, Quebec, Canada, 2004.
    [Li04c] Li, X.-B., Soong, F. K., & Nakamura, S., "Optimal Clustering of HMM Gaussian Kernels in Scalar Dimension", In Proceedings of the Fall Meeting of the Acoustical Society of Japan, vol. 1, pp. 79-80, Okinawa, Japan, 2004.
    [Li05] Li, X.-B., Soong, F. K., Myrvoll, T. A., & Wang, R.-H., "Optimal Clustering and Nonuniform Allocation of Gaussian Kernels in Scalar Dimension for HMM Compression", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 669-672, Philadelphia, PA, 2005.
    [Li06] Li, X.-B., & Wang, R.-H., "State Divergence-based Determination of the Number of Gaussian Components of Each State in HMM", Accepted by the IEEE Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, 2006.
    [Lin80] Linde, Y., Buzo, A., & Gray, R. M., "An Algorithm for Vector Quantizer Design", IEEE Transactions on Communications, vol. 28, no. 1, pp. 84-95, January 1980.
    [Low76] Lowerre, B. T., The Harpy Speech Recognition System, Ph. D. thesis, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, 1976.
    [Mak98] Mak, B. K., Towards A Compact Speech Recognizer: Subspace Distribution Clustering Hidden Markov Models, Ph. D. thesis, Department of Computer Science and Engineering, Oregon Graduate Institute of Science and Technology, April 1998.
    [Mar64] Martin, T. B., Nelson, A. L., & Zadell, H. J., "Speech Recognition by Feature Abstraction Techniques", Tech. Rep. AL-TDR-64-176, Air Force Avionics Lab, 1964.
    [Med97] Medasani, S., & Krishnapuram, R., "Determination of the Number of Components in Gaussian Mixtures Using Agglomerative Clustering", In Proceedings of the International Conference on Neural Networks, vol. 3, pp. 1412-1417, Houston, TX, 1997.
    [Mol02] Molau, S., Hilger, F., Keysers, D., & Ney, H., "Enhanced Histogram Normalization in the Acoustic Feature Space", In Proceedings of the 7th International Conference on Spoken Language Processing, vol. 1, pp. 1421-1424, Denver, Colorado, 2002.
    [Myr03] Myrvoll, T. A., & Soong, F. K., "Optimal Clustering of Multivariate Normal Distributions using Divergence and its Application to HMM Adaptation", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I-552-I-555, 2003.
    [Ney92] Ney, H., Haeb-Umbach, R., Tran, B.-H., & Oerder, M., "Improvements in Beam Search for 10000 Word Continuous Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 9-12, San Francisco, CA, 1992.
    [Nil80] Nilsson, N. J., Principles of Artificial Intelligence, Tioga Publishing Co., Palo Alto, CA, 1980.
    [Ode95] Odell, J. J., The Use of Context in Large Vocabulary Speech Recognition, Ph. D. thesis, Cambridge University, March 1995.
    [Oma03] Omar, M. K., & Hasegawa-Johnson, M., "Non-linear Maximum Likelihood Feature Transformation for Speech Recognition", In Proceedings of the 8th European Conference on Speech Communication and Technology, pp. 2497-2500, Geneva, Switzerland, 2003.
    [Ost97] Ostendorf, M., & Singer, H., "HMM Topology Design using Maximum Likelihood Successive State Splitting", Computer Speech and Language, vol. 11, no. 1, pp. 17-41, January 1997.
    [Pal94] Pallett, D. S., Fiscus, J. G., Fisher, W. M., Garofolo, J. S., et at., "1993 Benchmark Tests for the ARPA Spoken Language Program", In Proceedings of the ARPA Workshop on Human Language Technology, pp. 49-74, Princeton, NJ, 1994.
    [Pau91] Paul, D. B., "Algorithms for an Optimal A~* Search and Linearizing the Search in the Stack Decoder", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 693-696, Toronto, Canada, 1991.
    [Pov04] Povey, D., Discriminative Training for Large Vocabulary Speech Recognition, Ph. D. thesis, Cambridge University, July 2004.
    [Pri88] Price, P., Fisher, W. M., Bernstein, J., & Pallett, D. S., "The DARPA 1000-word Resource Management Database for Continuous Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 651-654, New York, NY, 1988.
    [Rab89] Rabiner, L. R., "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition"., Proceedings of the IEEE, voI. 77, no. 2, pp. 257-286, February 1989.
    [Rab93] Rabiner, L. R., & Juang, B.-H., Fundamentals of Speech Recognition, Prentice Hall, NJ, 1993.
    [Ris83] Rissanen, J., "A Universal Prior for Integers and Estimation by Minimum Description Length", Annals of Statistics, vol. 11, no. 2, pp. 416-431, 1983.
    [Sau00] Saul, L. K., & Rahim, M. G., "Maximum Likelihood and Minimum Classification Error Factor Analysis for Automatic Speech Recognition", IEEE Transactions on Speech and Audio Processing, vol. 8, no. 2, pp. 115-125, March 2000.
    [Sch78] Schwarz, G., "Estimating the Dimension of A Model", Annals of Statistics, vol. 6, no. 2, pp. 461-464, March 1978.
    [Sch98] Schluter, R., & Macherey, W., "Comparison of Discriminative Training Criteria", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 493-496, Seattle, WA, 1998.
    [Som03] Somervuo, P., "Experiments with Linear and Nonlinear Feature Transformations in HMM Based Phone Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. Ⅰ, pp. I-52-I-55, 2003.
    [Soo91] Soong, F. K., & Huang, E.-F., "A Tree-trellis based Fast Search for Finding the N-best Sentence Hypotheses in Continuous Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 705-708, Toronto, Canada, 1991.
    [Soo93] Soong, F. K., & Juang, B.-H., "Optimal Quantization of LSP Parameters", IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pp. 15-24, January 1993.
    [Soo04] Soong, F. K., Lo, W.-K., & Nakamura, S., "Optimal Acoustic and Language Model Weights for Minimizing Word Verification Errors", In Proceedings of the 8th International Conference on Spoken Language Processing, pp. 441-444, Jeju Island, Korea, 2004.
    [Tak92] Takami, J., & Sagayama, S., "A Successive State Splitting Algorithm for Efficient Allophone Modeling", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 573-576, San Francisco, CA, 1992.
    [Tak95] Takahashi, S., & Sagayama, S., "Four-Level Tied-Structure for Efficient Representation of Acoustic Modeling", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 520-523, Detroit, MI, 1995.
    [Ten99] Tenmoto, H., Kudo, M., & Shimbo, M., "Determination of the Number of Components Based on Class Separability in Mixture-based Classifiers", In Proceedings of the 3rd International Conference on Knowledge-Based Intelligent Information Engineering Systems, pp. 439-442, Adelaide, SA, 1999.
    [Tor96] de la Torte, A., Peinado, A. M., Rubio, A. J., Sanchez, V. E., et al., "An application of Minimum Classification Error to Feature Space Transformation for Speech Recognition", Speech Communication, vol. 20, no. 3-4, pp. 273-290, December 1996.
    [Tor02] de la Torte, A., Segura, J. C., Benitez, C., Peinado, A. M., et al., "Non-linear Transformations of the Feature Space for Robust Speech Recognition", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 401-404, Orlando, Florida, 2002.
    [Vin68] Vintsyuk, T. K., "Speech Discrimination by Dynamic Programming", Kibernetika, vol. 4, no. 2, pp. 81-88, January-February 1968.
    [Vit67] Viterbi, A. J., "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm", IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260-269, April 1967.
    [Wai89] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., et al., "Phoneme Recognition Using Time-Delay Neural Networks", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989.
    [Wak77] Wakita, H., "Normalization of Vowels by Vocal-Tract Length and Its Application to Vowel Identification", IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 25, no. 2, pp. 183-192, April 1977.
    [Wan03] Wang, X.-C., & Paliwat, K. K., "Feature Extraction and Dimensionality Reduction Algorithms and their Applications in Vowel Recognition", Pattern Recognition, vol. 36, no. 10, pp. 2429-2439, October 2003.
    [Wes01] Wessel, F., Schluter, R., Macherey, K., & Ney, H., "Confidence Measures for Large Vocabulary Continuous Speech Recognition", IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 288-298, March 2001.
    [Xia02] Xiang, B., Chaudhari, U. V., Navratil, J., Ramaswamy, G. N., et al., "Short-time Gaussianization for Robust Speaker Verification", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 681-684, Orlando, Florida, 2002.
    [You89] Young, S. J., & Russell, N. H., "Token Passing: A Simple Conceptual Model for Continuous Speech Recognition Systems", Tech. Rep. 38, Cambridge University Engineering Department, 1989.
    [You93] Young, S. J., & Woodland, P. C., "The Use of State Tying in Continuous Speech Recognition", In Proceedings of the 3rd European Conference on Speech Communication and Technology, vol. 3, pp. 2203-2206, Berlin, Germany, 1993.
    [You94a] Young, S. J., Odell, J. J., & Woodland, P. C., "Tree-Based State Tying for High Accuracy Acoustic Modelling", In Proceedings of the ARPA Workshop on Human Language Technology, pp. 307-312, Merrill Lynch Conference Centre, 1994.
    [You94b] Young, S. J., & Woodland, P. C., "State Clustering in Hidden Markov Model-Based Continuous Speech Recognition", Computer Speech and Language, vol. 8, no. 4; pp. 369-383, October 1994.
    [You94c] Young, S. R., "Detecting Misrecognitions and Out-of-Vocabulary Words", In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. Ⅱ/21-Ⅱ/24, Adelaide, SA, 1994.
    [You05] Young, S., Evermann, G., Gales, M., Hain, T., et al., The HTK Book, Cambridge University Engineering Department, 2005.
    [Zha01] Zhang, J., Zheng, F., Li, J., Luo, C., et al., "Improved Context-Dependent Acoustic Modeling for Continuous Chinese Speech Recognition", In Proceedin9s of the 7th European Conference on Speech Communication and Technology, pp. 1617-1620, Aalborg, Denmark, 2001.
    [Zho01] Zhou, J.-L., Chang, E., & Huang, C., "Selective MCE Training Strategy in Mandarin Speech Recognition", In Proceedings of the 7th European Conference on Speech Communication and Technology, vol. 3, pp. 1951-1954, Aalborg, Denmark, 2001.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700