语音识别置信度特征提取算法研究

英文题名：A Study of Feature Extraction Algorithm of Speech Recognition Confidence Measure
作者：国玉晶
论文级别：硕士
学科专业名称：模式识别与智能系统
中文关键词：置信度 ; 环境特征 ; 潜狄利克雷分配 ; 主题模型 ; 语义
英文关键词：confidence measure ; environment feature ; latent dirichlet allocation ; topic model ; semantic
学位年度：2010
导师：刘刚
学科代码：081104
学位授予单位：北京邮电大学
论文提交日期：2010-01-27

摘要

大规模连续语音识别的研究已经进行了二十多年,虽已取得了显著进展,但距离广泛应用还有相当的距离。在克服识别算法本身缺陷、追求识别性能提升的过程中,研究者们逐渐引入了置信度的概念,用它来衡量语音识别系统所作决策的可信程度。近年来,语音识别置信度在语音错误检测与错误纠正,无监督和半监督训练、多遍搜索技术和语料库中错误语料甄选等应用中都发挥了非常重要的作用。
     传统的语音识别置信度标注基于不同置信特征或者特征组合进行分类判决,目前常使用的置信特征主要来源于解码信息。但是,方面现有置信度特征对解码信息的挖掘仍局限于孤立和静态,而忽略了词与周围环境之间的关系；另一方面,目前声学特征仍占主要地位,而人类听觉实验表明,人在进行语音理解时,大约有30%的信息来自于语法、语义等知识的指导。因此,在置信度特征提取中,如何挖掘出词与环境之间的关系,同时提炼出词的语法和语义特征,从而提高识别后处理性能,是一个非常值得研究的问题。
     基于上述目的,本文在搭建传统语音识别置信度标记系统的基础上,提出了两种新的置信度特征,一是环境特征,分为上下文环境、动态环境、句全局环境三类,通过对解码信息的再加工,从空间与时间角度较全面地描述了词与环境之间的关系；二是基于主题相似性的语义层置信特征提取算法TSS (Topic Similarity based Semantic confidence feature extraction algorithm),通过主题模型LDA(Latent Dirichlet Allocation)计算得到识别结果中词的主题分布及其上下文的主题分布,并将二者之间的主题相似性作为词的语义置信特征。实验表明,本文提出的两种特征深入挖掘了解码层的有效信息,又增加了置信特征的信息来源,与解码层置信特征进行组合后能有效地提高置信度标注的精度。
The large vocabulary continuous speech recognition research has been studied for more than two decades, though significant progress has been made, but there is still a considerable distance from the wide range of applications. In the pursuit of overcoming the deficiencies inside recognition algorithm itself, and improving recognition performance, researchers have gradually introduced the concept of confidence measure, to measure in which degree we could trust the result of speech recognition system. In recent years, speech recognition confidence measure has played a very important role in many applications, including speech error detection and correction, no supervision and semi-supervised training, multi-search technology and corpus selection and verification, etc.
     Based on different feature-combinations, traditional speech recognition confidence is actually a confidence annotation or classification decisions, with mainly information from decoding messages. However, the current confidence features are still limited to isolated and static, while ignoring the the relationship between the words and their surrounding environment; on the other hand, acoustic features are still dominant, while the experiments show that in speech understanding, human beings depend approximately 30% of the information from the syntax, semantics and other non-acoustic knowledge. Therefore, how to dig out the relationship between words and the environment, to extract the characteristics of syntax and semantics of the word so as to enhance recognition performance of post-processing is a very worthwhile study in the field of feature extraction of confidence measure.
     For purpose above, in addition to build a traditional baseline system of speech recognition confidence annotation, this paper proposed two new confidence features. The first one is environmental feature, including context, dynamic and the global environment features, which extract more valuable information from the intermediate production of decoding, and provide a more comprehensive description of the relationship between words and the environment from both perspectives of space and time. The second is based on topic similarity of the semantic layer of confidence feature extraction algorithm TSS (Topic Similarity based Semantic confidence feature extraction algorithm), using a new theme Model LDA (Latent Dirichlet Allocation) we could calculated the distribution on theme of first the word in recognition results and then in the context. and distribution similarity between the theme and the word could be figured out as the semantic features of words in context. Experiments show that the two features proposed in this paper deeply excavated valuable decoding information, and, after combined with acoustic features, an significant increase in accuracy of confidence annotation experiment has been seen.

引文

[1]D. Bansal and M. K. Ravishankar, "New Features for Confidence Annotation". Proc. ICSLP-98, No.829,1998.
    [2]黄曾阳.HNC(概念层次网络)理论[M].北京：清华大学出版社,1998
    [3]Acoustic indexing for multimedia retrieval and browsing SJ Young, MG Brown, JT Foote, GJF Jones, K Sparck…-Acoustics, Speech, and Signal Processing,1997. ICASSP-97.,…,1997-ieeexplore.ieee.org
    [4]Detecting misrecognitions and out-of-vocabulary words SR Young-Acoustics, Speech, and Signal Processing,1994. ICASSP-94.,…,1994-ieeexplore.ieee.org
    [5]Confidence measures for the SWITCHBOARD database S Cox, R Rose-Acoustics, Speech, and Signal Processing,1996. ICASSP-96.…, 1996-ieeexplore.ieee.org
    [6]SO Kamppari, TJ Hazen, Word and phone level acoustic confidence scoring Acoustics, Speech, and Signal Processing,2000. ICASSP'OO.
    [7]Tree-based state tying for high accuracy acoustic modelling. SJ Young, JJ Odell, PC Woodland-Proceedings of the workshop on…,1994-portal.acm.org
    [8]S Kamppari, T Hazen Word and phone level acoustic confidence scoring IEEE INTERNATIONAL CONFERENCE ON…,2000-Citeseer
    [9]T Schaaf, T Kemp, Confidence measures for spontaneous speech recognition。 Acoustics, Speech and Signal Processing,1997. ICASSP-97.
    [10]Estimating confidence using word lattices T Kemp, T Schaaf-Proc. Eurospeech,1997-Citeseer
    [11]F Wessel, R Schluter, K Macherey, H Ney. Confidence measures for large vocabulary continuous speech recognition-IEEE Transactions on Speech…, 2001-Citeseer
    [12]Template constrained posterior probability F Soong, L Wang-2009-freepatentsonline.com
    [13]R Zhang, AI Rudnicky, Word level confidence annotation using combinations of features, Seventh European Conference on Speech…,2001-Citeseer
    [14]Guo, G., Huang, C., Jiang, H., Wang, R.-H.,2004. A comparative study on various confidence measures in large vocabulary speech recognition. Proc. of the fourth International Symposium on Chinese Spoken Language Processing, Hong Kong, China.
    [15]Cox, S. J. and Dasmahapatra, S. High-level Approaches to Confidence Estimation in Speech Recognition. IEEE Transactions on Speech and Audio,10 (7). pp.460-471,2002.
    [16]Inkpen, Diana and Desilets, Alain "Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts" In:Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing pp.49-56. Vancouver, British Columbia,
    [17]D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research,3:993-1022, January 2003.
    [18]T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences,101(suppll):5228-5235,2004.
    [19]Thomas Griffiths and Mark Steyvers.2004. Finding scientific topics. Proceedings of the National Academy of Sciences,101(suppl. 1):5228-5235.
    [20]Huang Xuedong, Acero Alex, Hon Hsiao-Wuen et al., Spoken language processing:a guide to theory, algorithm and system development, Prentice HALL PTR,2001.
    [21]Joseph W.P., Signal modeling techniques in speech recognition, In Proceedings of the IEEE,1993,81(9):1214-1247.
    [22]Paul D.B., An efficient A* stack decoder algorithm for continuous speech r ecognition with a stochastic language model, In Proceedings of ICASSP, Tomita, Masaru,1992,25-28.
    [23]Lee, C.H., Rabiner, L.R., A frame synchronous network search algorithm for connected word recognition, IEEE Trans. on Acoustics, Speech and Signal Processing,1989,37(11):1649-1658.
    [24]Chen,J-K., Soong, F,K., Lee, L-S, Large Vocabulary Word Recognition Based On Tree-Trellis Search, Acoustics, Speech, and Signal Processing, In Proceedings of ICASSP,1994.2:137-140.
    [25]Lee, C.H., Rabiner, L.R., A frame synchronous network search algorithm for connected word recognition, IEEE Trans. on Acoustics, Speech and Signal Processing,1989,37(11):1649-1658.
    [26]Aubert, X. and Ney, H.. Large vocabulary continuous speech recognition using word graphs. ICASSP.1995.1:49-52.
    [27]Murveit, H., Butzberger, J.W., Digalakis, et al. Large-vocabulary dictation using SRI's decipher speech recognition system:Progressive-search techniques. ICASSP.1993.2:319-322.
    [28]Mangu L, Brill E, Stolcke A. Finding consensus in speech recognition:word error minimization and other applications of confusion networks. Computer Speech and Language.2000,14 (4),373-400.
    [29]V.Goel,S.Kumar and W.J.Byrne.Segmental Minimum Bayes-risk Decoding for Automatic Speech Recognition.IEEE Transactions on Speech and Audio Processing.2004,12(3):234～249
    [30]G.Tur,D.Hakkani-Tur and G.Riccardi.Extending Boosting for Call Classification Using Word Confusion Networks.Proceedings of IEEE International Conference on Acoustics,Speech,and Signal Processing,2004:1520～6149
    [31]D.Hakkani-Tur,F.Bechet,G.Riccardi and G.Tur.Beyond ASR 1-best:Using Word Confusion Network in Spoken Language Understanding.Computer speech and language.2006,20(4):495-514
    [32]J.Shao,P.Y.Zhang,J.Han,J.Yang and Y.H.Yan.Syllable Based Audio Search Using Confusion Network Arc as Indexing Unit.Proceedings of the 5th International Symposium on Chinese Spoken Language Processing, Singapore,2006:179～187
    [33]Jian Xue and Yunxin Zhao, Improved confusion network algorithm and shortest path search from lattice, ICASSP. Singapore,2006.2:621-632.
    [34]Hui Jiang. Confidence Measure For Speech Recognition:A Survey [J]. Speech Communication,2005,45,455-470.
    [35]Inkpen, Diana and Desilets, Alain "Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts" In:Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing pp.49-56.
    [36]T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences,101(suppll):5228-5235,2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700