基于混淆网络和辅助信息的语音识别技术研究

英文题名：Research on Confusion Network and Side Information for Speech Recognition
作者：王欢良
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：语音识别 ; 混淆网络 ; 辅助信息 ; 多系统融合 ; 声调建模
英文关键词：speech recognition ; confusion network ; side information ; multi-system fusion ; tone modeling
学位年度：2007
导师：韩纪庆
学科代码：081203
学位授予单位：哈尔滨工业大学

摘要

通过语音与机器进行自由交流是人们多年以来的梦想。经过几十年的不懈努力,语音识别技术已获得了巨大进步,但仍难以满足实际应用的需要。如何进一步提高语音识别性能及其稳健性(Robustness)成为当前语音识别技术发展的瓶颈。
     人类在语音辨识过程中潜在地利用了众多信息源,而当前基于计算机的语音识别系统通常只利用了非常有限的声学和语言学信息,如语音的谱特征和N-gram统计语言模型。对于语音识别这种复杂任务来说,这些主要信息是远远不够的。有效地建模和应用其它辅助信息将有助于提高语音识别性能。混淆网络是多候选识别结果的一种紧凑表示形式,基于混淆网络解码可以最小化词错误率。基于混淆网络来融合辅助信息进行解码是提高识别性能的一个有效途径。
     本论文主要从混淆网络和辅助信息两个方面研究了改善语音识别性能的方法。在混淆网络方面,主要研究了混淆网络的高效构造方法和融合辅助信息的解码方法。在辅助信息方面,主要研究了几种重要辅助信息的有效建模和应用方法。本论文的主要研究内容和创新点具体如下:
     1.提出了两种高质量混淆网络的快速构造方法。一种方法通过对Lattice结构进行分段来降低混淆网络构造方法的计算规模,提高了混淆网络的生成速度,而其质量只有轻微下降。另一种方法利用具有最大后验概率的转移弧来指导混淆集合的构造,使算法复杂度降为线性。为了提高了生成混淆网络的质量,提出了基于K-L散度的弧相似性测度方法。最后,针对汉语语音识别任务,给出两种新的混淆网络结构:汉字混淆网络和逻辑混淆网络。
     2.提出了两类辅助信息的建模方法和应用于混淆网络的解码方法。为了利用词间的长距离依赖信息,提出了基于词义类对触发式语言模型的混淆网络解码方法。为了利用更多的辅助信息源,提出了基于多系统结果融合的混淆网络解码方法。实验结果显示两种方法可以使汉字错误率分别相对下降7.9%和10.7%。
     3.提出了利用声调辅助信息来改善汉语音识别性能的方法。在声学解码阶段,提出采用基于多空间分布的隐马尔可夫模型来对声调进行建模,解决了其特征不连续的问题。在双数据流隐马尔可夫模型框架下,对谱特征和基频特征进行同步解码,可使汉字错误率相对下降15.9%。在第二遍解码阶段,提出基于Supra-tone单元的独立声调建模方法。利用Supra-tone声调模型进行混淆网络解码,进一步使汉字错误率相对下降8.0%。
     4.开发了一个具有输入错误在线快速修正功能的汉语语音输入系统。通过利用汉字混淆网络,可以把句子级候选分解为汉字级候选,从而使用户能够利用候选快捷地修正近一半的识别错误。为了快速可靠地输入新的汉字,提出手写信息辅助的孤立汉字语音输入方法。这种方法具有比手写输入更快的速度,并且比单纯的语音输入更为可靠。
     综上所述,本文通过对混淆网络和辅助信息的研究提高了语音识别的性能和实用性。混淆网络的高效生成方法对于其它任务(如语音文档检索等)也会有很大帮助。采用触发语言模型和多系统结果合并的混淆网络解码方法为有效利用其它类型辅助信息提供了有益借鉴。对声调辅助信息的研究是充分利用声学辅助信息(如重音、语调等)的一个很好开端。利用混淆网络和手写辅助信息使语音输入错误的修正更为快捷可靠,这是辅助信息和混淆网络在语音识别任务中的一个成功应用。
Communicating freely with computer via speech is always people’s dream for many years. Although some great progress has been achieved in speech recognition area after several decades of unremitting efforts, it is still far away from the practical applications. How to further improve the performance and robustness has become the bottleneck of speech recognition.
     It is well-known that very limited acoustic and linguistics knowledge, i.e. spectral feature of speech signal and N-gram based statistical language model, is used in automatic speech recognition system. This information is far from enough for the complicated tasks like speech recognition since a large amount of information is implicitly utilized for human in the process of speech apperception.
     The performance of speech recognition can be improved by more effectively modeling and applying other side information. Confusion network is a more compact form representing multiple candidates, and word error rate can be minimized by performing second-pass decoding on confusion network. It is more significant for improving recognition performance to use confusion network as a decoding platform where various side information can be well integrated.
     Accordingly, two subjects are studied in this thesis: confusion network and side information. It is attempted to reduce character error rate by performing confusion network decoding with various side information. In the aspect of confusion network, the efficient approachs to generating and decoding confusion network are studied. In the aspect of side information, the effective methods are investigated to model and apply it. Major original works in the research are listed in details as follows:
     1 . Two approaches to efficiently generating confusion network are proposed. In the first one, lattice scale is reduced by segmenting original lattice into multiple sublattices, which can improve generation speed at a cost of slight decline of its quality. In the second one, the constructing process of confusion set is guided by the arc with maximum posterior probability, which can reduce the complexity of generation algorithm to linearity. Moreover, K-L divergence is introduced to measure the similarity between two arcs, which can increase the quality of confusion network. Finally, for Chinese speech recognition task two new structures of confusion network are introduced: character-based confusion network and logical confusion network.
     2 . Decoding methods integrating two types of side information on confusion network are studied. Trigger language model based on semantic class pairs is proposed to model dependence relationship between long-span words. The model is integrated with confusion network decoding process. Different speech recognition systems utilize different knowledge sources and modeling methods, consequently their error pattern is also different. A decoding method is proposed to combine the results from multiple recognition systems on confusion network. Experimental results show both methods can relatively reduced character error rate by 7.9% and 10.7%, respectively.
     3.It is investigated to use tone information to improve the performance of Chinese speech recognition. In the acoustic decoding stage, multi-space probability distribution based HMM (MSD-HMM) is adopted to model tone pattern, which resolves the problem that tone feature is discontinuous in the whole utterance. In the framework of two-stream HMM, spectral and pitch features can be decoded synchronously. In the second pass, tone information over a horizontal, longer time span is used to build explicit tone models which are apply to decoding on the confusion network generated in the first pass. Experimental results show that in the first-pass decoding 15.9% relative error reduction can be obtained in character recognition and an additional 8.0% relative error reduction by the second-pass decoding.
     4.A reliable speech input system with the ability of fast correcting input error is developed. Character-based confusion network is used to decompose sentence-level hypothesis into character-level one, which can allow the user to correct about half of recognition errors quickly and conveniently. In order to speed up new character input, speech recognition method assisted by handwriting information is proposed. It has faster input rate than single handwriting input and more reliable than single speech recognition.
     To sum up the above arguments, generation method of confusion network, its decoding methods integrating side information, modeling methods of side information and their application are investigated in this thesis, and the performance improvement is achieved for speech recognition. Efficiently constructing confusion network with high quality is the base of decoding, which is significant not only for speech recognition task but also for other tasks based on confusion network (such as speech document retrieval). The study on confusion network decoding methods, which integrate trigger language model based on semantic class pairs and the results from multi-system combination, also provides beneficial reference for utilizing other types of side information. Application of tone information remarkably improves the performance of speech recognition and also exhibits a good beginning for better utilizing various acoustic side information (such as stress, intonation etc). Speech input system becomes more reliable and its error correction process more convenient and efficient by using confusion network and handwriting information. This is a successful application of side information and confusion network in speech recognition.

引文

1 R. Koening, H.K. Dunn and L.Y. Lacey. The Sound Spectrograph. Journal of Acostic Society America. 1947, 18:19~49
    2 K.H. David. Automatic Recognition of Spoken Digits. Journal of Acostic Society America. 1952, 24:437~465
    3 L.R. Rabiner, S.E. Levinson and M.M. Sondhi. Application of Vector Quantization and Hidden Markov Models to Speaker-independent, Isolated Word Recognition. AT&T Technology Journal. 1983, 62(4):1075~1106
    4 X. Huang, A. Acero and H.-W. Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, 2001
    5 M. Oerder, H. Ney. Word Graphs: An Efficient Interface between Continuous Speech Recognition and Language Understanding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993:119~122
    6 L. Mangu, E. Brill and A. Stolcke. Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks. Computer Speech and Language. 2000, 14(4):373~400
    7 S.J. Young. Generating Multiple Solutions from Connected Word DP Recognition Algorithm. Proceedings of the Institute of Acoustics. 1984, 6(4):351~354
    8 R. Schwartz, S. Austin. A Comparison of Several Approximate Algorithms for Finding Multiple (N-Best) Sentence Hypotheses. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993:701~704
    9 F.K. Soong, E. F. Huang. A Tree-trellis Based Fast Search for Finding the N-best Sentence Hypotheses in Continuous Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1991:705~708
    10 Y.L. Chow. Maximum Mutual Information Estimation of HMM Parameters Using the N-Best Algorithm for Continuous Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing,1990:181~184
    11 M. Ostendorf, A. Kannan, O. Kimball, R. Schwartz, S. Austin and R. Rohlicek. Integration of Diverse Recognition Methodologies through Reevaluation of N-Best Sentence Hypotheses. Proceedings of DARPA speech and natural language workshop, 1991:83~87
    12 H. Ney, S. Ortmanns and I. Lindam. Extensions to the Word Graph Method for Large Vocabulary Continuous Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997:1791~1794
    13 S. Ortmanns, H. Ney. A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition. Computer Speech and Language. 1997, 11:43~72
    14 H. Ney, S. Ortmanns. Dynamic Programming Search for Continuous Speech Recognition. IEEE Signal Processing Magazine. 1999, 1(5):64~83
    15 A. Sixtus, H. Ney. From Within-word Model Search to Across-word Model Search in Large Vocabulary Continuous Speech Recognition. Computer Speech and Language. 2002, 16:245~271
    16 B.-H. Juang, S. Katagiri. Discriminative Learning for Minimum Error Classification. IEEE Transactions Acoustics, Speech and Signal Processing. 1992, 40(12):3043~3054
    17 V. Valtchev, J.J. Odell, P.C. Woodland and S.J. Young. Lattice-based Discriminative Training for Large Vocabulary Speech Recognition. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1996:605~608
    18 D. Povey, P.C. Woodland. Minimum Phone Error and I-smoothing for Improved Discriminative Training. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2002:105~108
    19 P.C. Woodland, D. Povey. Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition. Computer Speech and Language. 2002, 16:25~47
    20 C. Neukirchen, D. Klakow and X. Aubert. Generation and Expansion of Word Graphs Using Long Span Context Information. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing,2001:41~44
    21 N. Jennequin, J.L. Gauvain. Lattice Rescoring Experiments with Duration Models. TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, Spain, June 2006:155~158
    22 V. Venkataramani, W. Byrne. Lattice Segmental and Support Vector Machines for Large Vocabulary Continuous Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005:817~820
    23 S.M. Siniscalchi, J.Y. Li and C.-H. Lee. A Study on Lattice Rescoring with Knowledge Scores for Automatic Speech Recognition. Proceedings of the 9th International Conference on Spoken Language Processing, Pittsburgh, Pennsylvania, US, September, 2006:517~520
    24 B. Rroark. Markov Parsing: Lattice Rescoring with A Statistical Parser. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, US, July, 2002:287~294
    25 A. Hatch, B. Peskin. Improved Phone Speaker Recognition Using Lattice Decoding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005:169~172
    26 D.A. James, S.J. Young. A Fast Lattice-based Approach to Vocabulary Independent Word Spotting. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994:377~380
    27 P. Yu, F. Seide. Fast Two-stage Vocabulary-independent Search in Spontaneous Speech. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005:482~485
    28 A. Stolcke, Y. Konig and M. Weintraub. Explicit Word Error Minimization in N-best List Rescoring. Proceeding of EUROSPEECH, 1997:163~165
    29 V. Goel, W.J. Byrne. Minimum Bayes-risk Automatic Speech Recognition. Computer Speech and Language. 2000, 14(2):115~135
    30 D. Hakkani-Tur, G. Riccardi. A General Algorithm for Word Graph Matrix Decomposition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003:596~599
    31 J. Xue, Y.-X. Zhao. Improved Confusion Network Algorithm and Shortest Path Search from Word Lattice. Proceedings of IEEE International Conferenceon Acoustics, Speech, and Signal Processing, 2005:853~856
    32 V. Goel, S. Kumar and W.J. Byrne. Segmental Minimum Bayes-risk Decoding for Automatic Speech Recognition. IEEE Transactions on Speech and Audio Processing. 2004, 12(3):234~249
    33 N. Bertoldi, M. Federico. A New Decoder for Spoken Language Translation Based on Confusion Networks. IEEE ASRU Workshop, 2005:86~91
    34 G. Tur, J. Wright, A. Gorin, G. Riccardi and D. Hakkani-Tur. Improving Spoken Language Understanding Using Word Confusion Networks. Proceedings of the 7th International Conference on Spoken Language Processing, 2002:1137~1140
    35 G. Tur, D. Hakkani-Tur and G. Riccardi. Extending Boosting for Call Classification Using Word Confusion Networks. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004:1520~6149
    36 D. Hakkani-Tur, F. Bechet, G. Riccardi and G. Tur. Beyond ASR 1-best: Using Word Confusion Network in Spoken Language Understanding. Computer speech and language. 2006, 20(4):495~514
    37 L. Mangu, M. Padmanabhan. Error Corrective Mechanisms for Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001:29~32
    38 D. Hillard, M. Ostendorf, A. Stolcke, Y. Liu and E. Shriberg. Improving Automatic Sentence Boundary Detection with Confusion Networks. Proceedings of Human Language Technology Conference / North American Chapter of the Association for Computational Linguistics Annual Meeting, 2004:69~72
    39 D. Hillard, M. Ostendorf. Compensation Forward Posterior Estimation Bias in Confusion Networks. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2006:1153~1156
    40 D. Hakkani-Tur, G. Riccardi. Active Learning for Automatic Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002:3904~3907
    41 J. Xue, Y.-X. Zhao. Random Forests-based Confidence Annotation Using Novel Feature from Confusion Network. Proceedings of IEEE InternationalConference on Acoustics, Speech, and Signal Processing, 2006:1149~1152
    42 J. Shao, P.Y. Zhang, J. Han, J. Yang and Y.H. Yan. Syllable Based Audio Search Using Confusion Network Arc as Indexing Unit. Proceedings of the 5th International Symposium on Chinese Spoken Language Processing, Singapore, 2006:179~187
    43 P.J. Gorniak, D. Roy. Probabilistic Grounding of Situated Speech using Plan Recognition and Reference Resolution. Proceedings of the 7th International Conference on Multimodal Interfaces, Trento, Italy, Oct. 2005:261~266
    44 R. Lau, R. Rosenfeld and S. Roukos. Trigger-based Language Model: A Maximum Entropy Method. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993:45~48
    45 侯珺, 王作英. 一种词义与词的混合语言模型及其应用. 中文信息学报. 2002, 15(6):7~13
    46 J.R. Bellegarda. A Multi-span Language Modeling Framework for Large Vocabulary Speech Recognition. IEEE Transactions on Speech and Audio Processing. 1998, 6(5):456~467
    47 J.R. Bellegarda. Large Vocabulary Speech Recognition with Multi-span Statistical Language Models. IEEE Transactions on Speech and Audio Processing. 2000, 8(1):76~84
    48 R. Beutler, T. Kaufmann and B. Pfister. Using Rule-based Knowledge to Improve LVCSR. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005:830~833
    49 张家騄, 齐士琻, 宋美珍, 刘全祥. 汉语声调在言语可懂度中的重要作用. 声学学报. 1981, 4:237~241
    50 W. Yang, J. Lee, Y. Chang and H. Wang. Hidden Markov Model for Mandarin Lexical Tone Recognition. IEEE Transactions on Acoustic Speech Signal Processing. 1988, 36:988~992
    51 S.H. Chen, Y.R. Wang. Tone Recognition of Continuous Mandarin Speech Based on Neural Networks. IEEE Transactions on Speech and Audio Processing. 1995, 3(3):146~150
    52 T. Lee, P.C. Ching, L.W. Chan, B. Mak and Y.H. Cheng. Tone Recognition of Isolated Cantonese Syllables. IEEE Transactions on Speech and Audio Processing. 1995, 3(3):204~209
    53 D. Hirst, R. Espesser. Automatic Modeling of Fundamental Frequency Using a Quadratic Spline Function. Travaux de l'Institut de Phonétique d'Aix 15, 1993:71~85
    54 C.J. Chen, R.A. Gopinath, M.D. Monkowski, M.A. Picheny and K. Shen. New Methods in Continuous Mandarin Speech Recognition. Proceedings of EUROSPEECH, 1997:1543~1546
    55 E. Chang, J.L. Zhou, S. Di, C. Huang and K.-F. Lee. Large Vocabulary Mandarin Speech Recognition with Different Approach in Modeling Tones Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, China, 2000:983~986
    56 Y.J. Li, T. Lee and Y.Qian. Analysis ans Modeling of F0 Contours for Cantonese Text-to-speech. ACM Transactions on Asian Language Information Processing. 2004, 3(2):169~180
    57 C.H. Lin, C.H. Wu, P.Y. Ting and H.M. Wang. Frameworks for Recognition of Mandarin Syllables with Tones Using Sub-syllabic Units. Journal of Speech Communication. 1996, 18(2):175~190
    58 T. Lee, W. Lau, Y. Wong and P.C. Ching. Using Tone Information in Cantonese Continuous Speech Recognition. ACM Transactions on Asian Language Information Processing. 2002, 1(2):83~102
    59 赵力, 邹采荣, 吴镇扬. 基于连续分布型 HMM 的汉语连续语音的声调识别方法. 信号处理. 2000, 16(1):20~23
    60 曹阳, 黄泰翼, 徐波. 基于统计方法的汉语连续语音中声调模式的研究. 自动化学报. 2004, 30(2):191~198
    61 章文义, 朱杰, 徐向华. 利用声调提高中文连续数字串语音识别系统性能. 上海交通大学学报. 2004, 8(2):185~88
    62 J. Fiscus. A Post-processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER). IEEE ASRU Workshop, 1997:47~54
    63 H. Schwenk, J.-L. Gauvain. Combining Multiple Speech Recognizers Using Voting and Language Model Information. Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, China, 2000:915~918
    64 V. Goel, S. Kumar and W. Byrne. Segmental Minimum Bayes-risk ASRVoting Strategies. Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, China, 2000:139~142
    65 G. Evermann, P.C. Woodland. Posterior Probability Decoding, Confidence Estimation and System Combination. Proceedings of Speech Transcription Workshop, College Park, MD, 2000:317~320
    66 T. Utsuro, Y. Kodama, T. Watanabe, H. Nishizaki and S. Nakagawa. An Empirical Study on Multiple LVCSR Model Combination by Machine Learning. Proceedings of Human Language Technology Conference / North American Chapter of the Association for Computational Linguistics Annual Meeting, 2004, 2:13~16
    67 A. Sankar. Bayesian Model Combination (BAYCOM) for Improved Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005:845~848
    68 C. Breslin and M. Gales. Generating complementary systems for speech recognition. Proceedings of the 9th International Conference on Spoken Language Processing, Pittsburgh, Pennsylvania, US, Sep. 2006:327~330
    69 E.D. Petajan. Automatic Lip Reading to Enhance Speech Recognition. Ph.D. Thesis, University of Illinois at Urbana-Champaign, US, 1984
    70 T. Chen. Audiovisual Speech Processing. IEEE Signal Processing Magazine. Jan. 2001, 18:9~21
    71 N.M. Brooke. Using the Visual Component in Automatic Speech Recognition. Proceedings of the 4th International Conference on Spoken Language Processing, Philadelphia, PA, Oct. 1996:1656~1659
    72 A. Verma. Late Integration in Audio-visual Continuous Speech Recognition. Journal of the Acoustical Society of America. 2004, 116(2):635~635
    73 刘鹏, 王作英. 多模式汉语连续语音识别中视觉特征的提取和应用. 中文信息学报. 2004, 18(4):79~84
    74 X. Zhou, Y. Tian, J.-L. Zhou, F.K. Soong and B.-Q. Dai. Improved Chinese Character Input by Merging Speech and Handwriting Recognition Hypotheses. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2006:609~612
    75 S.B. Davis, P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEETransactions on Signal Processing. 1980, 28:357~366
    76 H. Hermansky. Perceptual Linear Predictive (PLP) Analysis of Speech. Journal of the Acoustical Society of America. 1990, 87(4):1738~1752
    77 J. Picone. Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE, 1993, 81(9):1215~1248
    78 L.R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 1989, 77(1):257~286
    79 A.P. Dempster, N.M. Laird and D.B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society (Series B). 1977, 39(1):1~38
    80 D.B. Paul. The Lincoln Tied Mixture HMM Continouous Speech Recognizer. Proceedings of DARPA Speech and Natural Language Workshop, 1990:332~336
    81 M.Y. Hwang, X. Huang. Shared Distribution Hidden Markov Models for Speech Recognition. IEEE Transations on Speech and Audio Processing. 1993, 1(4):414~420
    82 S.J. Young, P.C. Woodland and W.J. Byrne. State Clustering in HMM-based Continuous Speech Recognition. Computer Speech and Language. 1994, 8(4):369~384
    83 H. Ney, U. Essen and R. Knesser. On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Computer Speech and Language. 1994, 8(1):1~38
    84 F. Allera, X. Huang and M.Y. Hwang. Improvements on the Pronunciation Prefix Tree Search Organization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996:133~136
    85 D.B. Paul. Algorithms for an Optimal A* Aearch and Linearizing the Search in the Stack Decoder. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1991:693~696
    86 R.E. Bellman. Dynamic Programming. Princeton University Press, Princeton, New Jersey, US, 1957
    87 E. Chang, Y. Shi, J.-L. Zhou and C. Huang. Speech Lab in a Box: a Mandarin Speech Toolbox to Jumpstart Speech Related Research.Proceedings of EUROSPEECH, Aalborg, Denmark, 2001:2799~2802
    88 J. Du, P. Liu, F.K. Soong, J.-L. Zhou and R.-H. Wang. Noisy Speech Recognition Performance of Discriminative HMMs. Proceedings of the 5th International Symposium on Chinese Spoken Language Processing, Singapore, 2006:385~386
    89 E. Chang, J.-L. Zhou, S. Di, C. Huang and K.-F. Lee. Large Vocabulary Mandarin Speech Recognition with Different Approaches in Modeling Tones. Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, China, 2000:983~986
    90 C. Huang, Y. Shi, J.-L. Zhou, M. Chu, T. Wang and E. Chang. Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004:901~904
    91 S.Young, G. Evermann, et al. The HTK Book (for HTK 3.3). Dec. 2005, at http://htk.eng.cam.ac.uk
    92 V.I. Levenshtein. Binary Codes Capable of Correcting Spurious Insertion and Deletion of Ones. Problems of Information Transmission. 1965, 1(1):8~17
    93 S. Kullback, R.A. Leibler. On Information and Sufficiency. Ann. Math. Stat. 1951, 22:79~86
    94 J. Du, P. Liu, F.K. Soong, J.-L. Zhou and R.-H. Wang. Minimum Divergence Based Discriminative Training. Proceedings of the 9th International Conference on Spoken Language Processing, Pittsburgh, Pennsylvania, US, Sep. 2006:2410~2413
    95 P. Liu, F.K. Soong and J.-L. Zhou. Effective Estimation of Kullback-Leibler Divergence between Speech Models. Technology Report, Microsoft Research Asia, 2005
    96 P.F. Brown, V.J. Della and P.V. Pietra. Class-based N-gram Models of Natural Language. Computational Linguistics. 1992, 18(4):467~479
    97 R. Rosenfeld. Adaptive Statistical Language Modeling: a Maximum Entropy Approach. Ph.D. Thesis, Carneige Mellon University, Boston, MA, US, 1994
    98 李明琴, 李涓子, 王作英, 陆大琻. 语义分析和结构化语言模型. 软件学报. 2005, 16(9):1523~1533
    99 M. Ostendorf, V. Digilakis and O. Kimball. From HMMs to Segment Models:a Unified View of Stochastic Modeling for Speech Recognition. IEEE Transactions on Speech and Audio Processing. 1996, 4(5):360~378
    100 R. Iyer, H. Gish, M. Siu, G. Zavaliagkos and S. Matsoukas. Hidden Markov Models for Trajectory Modeling. Proceedings of the 5th International Conference on Spoken Language Processing, 1998:1811~1814
    101 H. Yu, A. Waibel. Integrating Thumbnail Feature for Speech Recognition Using Conditional Exponential Models. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004:893~896
    102 G.D. Zhou, K.T. Lau. Interpolation of N-gram and Mutual-information Based Trigger Pair Language Models for Mandarin Speech Recognition. Computer speech and language. 1999, 13:125~141
    103 A. Berger, S.D. Pietra and V.D. Pietra. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics. 1996, 22:39~71
    104 Y. Qian. Use of Tone Information in Cantonese LVCSR Based on Generalized Character Posterior Probability Decoding. Ph.D. Thesis, CUHK, Hong Kong, 2005
    105 C.H. Lin, C.H. Wu, P.Y. Ting and H.M. Wang. Framework for Recognition of Mandarin Syllables with Tones Using Sub-syllabic Units. Journal of Speech Communication. 1996, 18(2):175~190
    106 Y. Qian, F.K. Soong and T. Lee. Tone-enhanced Generalized Character Posterior Probability (GCPP) for Cantonese LVCSR. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2006:133~136
    107 K. Tokuda, T. Masuko, N. Miyazaki and T. Kobayashi. Multi-space Probability Distribution HMM. IEICE Trans. Inf. & Syst. 2002, E85-D(3): 455~464
    108 K. Shinoda, T. Watanabe. Acoustic Modeling Based on the MDL Principle for Speech Recognition. Proceedings of EUROSPEECH, 1997:99~102
    109 Entropic Signal Processing System (ESPS). Entropic Speech Inc. 1996, At http://www.entropic.com/esps.html
    110 J.-L. Zhou, Y. Tian, Y. Shi, C. Huang and E. Chang. Tone Articulation Modeling for Mandarin Spontaneous Speech Recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing,2004:997~1000
    111 J.J. Li, X.D. Xia and S.S. Gu. Mandarin Four-tone Recognition with the Fuzzy C-means Aligorithm. Proceedings of FUZZ-IEEE, 1999, 2:1059~1062
    112 S.H. Chen, Y.R. Wang. Tone Recognition of Continuous Mandarin Speech Based on Neural Networks. IEEE Transactions on Speech and Audio Processing. 1995, 3(3):204~209
    113 Y. Tian, J.-L. Zhou, M. Chu and E. Chang. Tone Recognition with Fractionized Models and Outlined Features. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004:105~108
    114 丁晓青. 汉字识别研究的回顾. 电子学报. 2002, 30(9):1364~1368
    115 L. Zhang, C. Huang, M. Chu, F.K. Soong, X.D. Zhang and Y.D. Chen. Automatic Detection of Tone Mispronunciation in Mandarin. Proceedings of the 5th International Symposium on Chinese Spoken Language Processing, Singapore, 2006:590~601
    116 W.K. Lo, F.K. Soong and S. Nakamura. Generalized Posterior Probability for Minimizing Verification Errors at Subword, Word and Sentence Levels. Proceedings of the 4th International Symposium on Chinese Spoken Language Processing, Hong Kong, China, 2004:13~16

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700