用户名: 密码: 验证码:
语音识别置信度研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
手机端的语音短信输入,可以免去人们手工输入短信的不便,有着实际的应用需求,但是尚未得到很好的解决。因此短信语音识别成为当前语音识别的一个热点问题。短信语音有语句短,口语化强的特点,识别起来有很大的难度。短信语音识别要解决的主要问题有:手机语音库的建设,识别系统开发,识别结果置信度评价等。
     本文对短信语音识别问题进行了研究,建立了性能优越的短信语料库和语音库,搭建了置信度评价系统。此外,还针对分类样本不平衡问题做了初步的不平衡数据集分类问题研究。
     本文的研究重点为短信语音语料库的建立和置信度分类中的特征提取与特征选择,主要的工作如下:
     1.建立了性能优越的短信语音语料库
     良好的语音库和语料库的建立对于声学模型和语言模型的训练都有很大的帮助作用,对于系统测试也必不可少。本文实现了短信注音系统,根据短信语料的特性,选择了合理的语料选择算法从五十万条原始短信中,自动选择出了6000句语音学角度丰富的短信语料。在保证稀有三音子全部被选择出的前提条件下,使三音子尽量平衡。6000句三音子理论覆盖率达到93.9%,实际覆盖率达到100%。并以此建立了200人参与录音,时长超过300小时的手机语音库。
     2.置信度分类中的特征提取和特征选择
     在语音短信输入的应用中,识别结果的可靠性是一个实际要解决问题。传统的语音识别置信度方法基于各种静态特征进行分类判决,而忽略了词与周围环境之间的关系所携带的信息。本文在一个词错误率为14.02%的基线系统上,利用10维静态特征做分类,比基线系统的错误率降低了24.9%。进一步在静态特征的基础上提出了上下文特征和动态特征,它们和静态特征组合在一起的特征分类效果比静态特征提高了7.4%。但是并非所有特征都对分类效果有正面影响,过多的特征不但带来信息的冗余,还会使分类速度变慢。针对这个问题,本文将特征提取和特征选择引入到语音识别置信度的研究中,提出了用特征提取的方法降低特征维数和用特征选择的方法从原始特征中选择出一个有效的子集。实验表明本文提出的上下文和动态特征是相对重要的分类特征,并且通过特征提取和特征选择可以得到有效压缩。
     3.不平衡数据集分类
     置信度分类所采用的实验数据为语音识别过程中所产生各种特征。由于识别率较高,造成了正确样本数与错误样本数的比例接近到了8:1。针对置信度分类模型训练中,正确样本数和错误样本数不平衡的问题,作者对不平衡数据集分类问题进行了初步的研究。提出了欠采样改进的办法,在正确类样本正确分类率下降不多的前提条件下,使分类器对错误类样本的正确分类率得到了显著的提高。
Voice short message input in mobile phone can give people great convenience. It has practical application, but has not yet been well developed. Therefore, voice short message recognition became a hot issue in speech recognition. Because of the characteristics of short message, its recognition can be very difficult. The main problems of voice short message recognition are:the construction of mobile phone speech database, recognition system development, confidence measures of recognition results.
     The paper researched on voice short message recognition. We constructed a good speech database and text corpus, and built a confidence measure evaluation system. In addition, we gave a preliminary study on imbalanced data set classification.
     The research focuses on feature extraction and feature selection in confidence measure classification and imbalanced data set classification. The main research contents are described in details as follows:
     The establishment of a good speech corpus is of great help to the training of acoustic and language model.This paper analyzed three kinds of corpus selection algorithms.According to the characteristics of SMS corpus, a reasonable corpus selection algorithm was developed.6,000 phonetically rich SMS messages were chosen from 500,000 raw SMS materials.In the precondition of all rare triphones are selected out of raw materials, we tried to balance the triphone. The theoretical triphone coverage rate of 6000 SMS reached 93.9%, and the actual coverage rate reached 100%.And we built a more than 300 hours SMS speech database based on the corpus we chose, which involved 200 people.
     Traditional speech recognition methods based on static features of a word to justify whether the word is correctly recognized or not, which neglected the information carried by its contexts and the surrounding environment. In this paper, a 14.1% word error rate(WER) speech recognizer(SR) is used as the baseline system, and 10-dimension static features achieved 24.9% decline of Classification Error Rate(CER). Context features and dynamic features are extracted in relation to the static features.The total 42-dimension features get a better CER of 7.4% than static features.But not all these features have a positive impact on the classification.Too many features not only take redundant information, but also make the classification process time-consuming.To solve this problem, feature extraction which can extract prime information from original features and feature selection method which can select effective features from the original feature set are proposed in this paper.The experimental results show that context features and dynamic features are effective features for classification, and the features can be considerably compressed through feature extraction and feature selection.
     The experimental data of confidence measure classification came from the process of speech recognition. As the recognition rate is relatively high, the ratio of correct and wrong the number of samples has reached 8:1.To deal of this problem, imbalanced data set classification is drawn into study. The author carried out a preliminary study IDS classification and used downsampling method to solve this problem. The classification rate of wrong class samples approved a lot while the classification rate of correct class rate only reduced a little.
引文
[1]Kai-fu Lee et al. Recent Progress and Future Outlook of the SPHINX Speech Recognition System. Computer Speech and Language,Vol.4, pp.57-69,1990.
    [2]Novak M, Hampl R, Krbec P, Bergl V, Sedivy J.Two-pass search strategy for large list recognition on embedded speech recognition platforms. Acoustics, Speech and Signal Processing. ICASSP proceedings.2003. pp.185-188.
    [3]Sukkar, R.A.,A.R. Setlur, M.G. Rahim, C.H.Lee. Utterance Verification of keyword strings using word-based minimum verification error (WB-MVE) training. ICASSP 1996. Pp.518-521.
    [4]Sukkar, R.A.,C.H. Lee. Vocabulary independent discriminative utterance verification for non-keyword rejection in sub-word based speech recognition. IEEE Trans. Speech Audio Process.4 (6),1996. pp.420-429.
    [5]Legetter, C.J. and P. C.Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. Computer, Speech, and Language, vol.9,1995. pp.171-185.
    [6]Nguyen, P.,Ph. Gelin, J. C.Junqua, and J. T. Chien. N-best based supervised and unsupervised adaptation for native and non-native speakers in cars. ICASSP 1999.vol. 1. Pp.173-176.
    [7]Pitz, M.,F.Wessel, and H.Ney. Improved MLLR speaker adaptation using confidence measures for conversational speech recognition. ICSLP 2000, vol.4. pp. 548-551.
    [8]Kemp, T. and A. Waibel. Unsupervised training of a speech recognizer using TV broadcast. ICSLP 1998.vol.5, pp.2207-2210.
    [9]Kemp, T. and A. Waibel.Unsupervised training of a speech recognizer:Recent experiments. EuroSpeech 1999. vol.5. pp.2725-2728.
    [10]Lamel, L.,J. L. Gauvain, and G. Adda. Investigating lightly supervised acoustic model training. ICASSP 2001. vol.1. pp.477-480.
    [11]Fetter, P.,F. Dandurand, and P.Regel-Brietzmann. Word graph rescoring using confidence measures. ICSLP 1996, vol.1, pp.10-13.
    [12]Goel, V. and W. Byrne. Minimum Bayes risk automatic speech recognition. Computer, Speech, and Language, vol.14, pp.115-135,2000.
    [13]Mangu, L.,E.Brill, and A. Stolcke. Finding consensus among words:Lattice-based word error minimization.in Proceedings ISCA European Conference on Speech Communication and Technology 1999, vol.1, pp.495-498.
    [14]Setlur, A. R.,R. A. Sukkar, and J. Jacob.Correcting recognition errors via discriminative utterance verification. ICSLP 1996, vol.2, pp.602-605.
    [15]Stolcke, A.,Y. Konig, and M. Weintraub.Explicit word error rate minimization in N-best list rescoring. EuroSpeech 1997, vol.2, pp.163-166.
    [16]Wessel, F.,R. Schluter, and H. Ney. Explicit word error minimization using word posterior probabilities. ICASSP 2001, vol.1, pp.33-36.
    [17]Wessel, F.,R. Schluter, and H. Ney. Using posterior probabilities for improved speech recognition. ICASSP 2000, vol.3, pp.1587-1590.
    [18]Wilpon, J.G.,Lee, C.H.,Rabiner, L.R.Application of Hidden Markov Models for Recognition of a Limited Set of Words in Unconstrained Speech [J]. ICASSP,1989,3: 254-257.
    [19]C J Chen, R A Gopinath, et al. A Continuous Speaker-Independent Putonghua Dictation System. Proc. International Conference Signal Processing,1996:821-824.
    [20]张宜.汉语语音识别技术的研究与发展.广西广播电视大学学报.2003,Vol.14,No.4,18-22.
    [21]何湘智.语音识别的研究与发展.计算机与现代化.2002.Vol.79,No.3,3-6.
    [22]姚文冰,姚天任,韩涛.稳健语音识别技术研究.计算机工程与应用.2002,No.7,69-71.
    [23]陈柏林,http://berlin.csie.ntnu.edu.tw/Courses,2008.
    [24]Huang Xuedong, Acero Alex, Hon Hsiao-Wuen et al.,Spoken language processing:a guide to theory, algorithm and system development, Prentice HALL PTR,2001.
    [25]Bridle, J.S.,An Efficient Elastic-Template Method for Detecting Given Words in Running Speech, Brit. Acoust. Soc. Meeting,1973.1-4.
    [26]Daniel Jurafsky & James H. Martin, Speech and language processing:An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2ed). Prentice-Hall,2006.
    [27]Joseph W.P.,Signal modeling techniques in speech recognition, In Proceedings of the IEEE, 1993,81(9):1214-1247.
    [28]Rabiner L.R.,A tutorial on hidden Markov models and selected applications in speech recognition, In Proceedings of the IEEE,1989,77(2):257-285.
    [29]宋战江,汉语自然语音识别中发音建模的研究.[博士学位论文].北京:清华大学.计算机科学与技术系,2001.
    [30]赵庆卫,王作英等.汉语连续语音识别中上下文相关的识别单元(三音子)的研究.电子学报.1999,27(6):79-82.
    [31]李净,郑方,张继勇等.汉语连续语音识别中上下文相关的声母建模[J].清华大学学报. 2004.44(1):61-64.
    [32]Jones GJF, Foote JT, Sparck Jones K, Young SJ(1995) Video Mail Retrieval:the effect of word spotting accuracy on precision. In:Proc. ICASSP 95,IEEE CS Press, Piscataway,1: 309-312.
    [33]Aubert, X.L.,An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition, Computer Speech and Language,2002.16(1):89-114.
    [34]Clarkson P.R.,Rosenfeld R.,Statistical language modeling using the CMU-Cambridge toolkit, In Proceedings of ESCA Euro-Speech,1997,2707-2710.
    [35]Paul D.B.,An efficient A* stack decoder algorithm for continuous speech recognition with a stochastic language model, In Proceedings of ICASSP, Tomita, Masaru,1992,25-28.
    [36]Lee, C.H.,Rabiner, L.R.,A frame synchronous network search algorithm for connected word recognition, IEEE Trans.on Acoustics, Speech and Signal Processing,1989,37(11): 1649-1658.
    [37]Sakoe H.and Chiba S.,Dynamic programming algotithm optimization for spoken word recognition, IEEE Trans.on Acoustic, Speech and Signal Processing,1975,23(1):67-72.
    [38]Chen,J-K.,Soong, F,K.,Lee, L-S, Large Vocabulary Word Recognition Based On Tree-Trellis Search, Acoustics, Speech, and Signal Processing, In Proceedings of ICASSP, 1994.2:137-140.
    [39]Lee, A.,Kawahara, T.,Doshita, S.,An Efficient Two-Pass Search Algorithm using Word Trellis Index, Proc. ICSLP,1998. pp.1831-1834.
    [40]祖漪清.汉语连续语音数据库的语料设计.声学学报.vol.24 No.3,May 1999.pp.236-247.
    [41]吴华,徐波,黄泰翼.基于三音子模型的语料自动选择算法.软件学报.vol.11 No.2,2000.pp.271-276.
    [42]康恒,刘文举.基于综合因素的汉语连续语音库语料自动选取.中文信息学报.vol.17No.4.2003.pp.27-32.
    [43]肖毅,李治柱.中文普通话电话语音数据库的研制.多媒体技术及应用.vol.28 No.8.August 2002. pp.204-231.
    [44]XIONG Zhengyu, ZHENG Fang, LI Jing and WU Wenhu. An Automatic Prompting Texts Selecting Algorithm for di-IFs Balanced Speech Corpus. National Conference on Man-Machine Speech Communications(NCMMSCC7),Nov.2003. pp.252-256.
    [45]Jing Li, Fang Zheng, Zhengyu Xiong and Wenhu Wu.Construction of Large-Scale Shanghai Putonghua Speech Corpus for Chinese Speech Recognition. Oriental-COCOSDA. Oct.1-3. pp.62-69.
    [46]Thomas Fang Zheng. Making Full Use of Chinese Speech Corpora. Oriental-COCOSDA. Oct.1-3, pp.9-23.
    [47]赵庆卫,王作英等.汉语连续语音识别中上下文相关的识别单元(三音子)的研究.电子学报.1999,27(6):79-82.
    [48]李净,郑方,张继勇等.汉语连续语音识别中上下文相关的声母建模[J].清华大学学报.2004.44(1):61-64.
    [49]Charles Jankowski, Ashok Kalyanwamy, Sara Basson, and Judith Spitz, "NTIMIT:A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database. Speech, and Signal Processing.1990. ICASSP-90,1990 International Conference on,3-6 April 1990. pp.109-112.
    [50]Hui Jiang. Confidence Measure For Speech Recognition:A Survey [J].Speech Communication,2005,45,455-470.
    [51]T.Kemp and T. Schaaf. Estimating confidence using word lattices [A].Proc.EUROSPEECH [C].Rhodes, Greece:ESCA,1997.
    [52]F. Wessel, R. Schluter, K. Macherey, H.Ney. Confidence measures for large vocabulary continuous speech recognition. [J]IEEE Trans. Speech Audio Process.2001.9(3).288-298
    [53]Stephen Cox and Srinandan Dasmahapatra. High-level approaches to confidence estimation in speech recognition. [J]IEEE Trans. Speech Audio Process.2002.10(7).460-471
    [54]L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition:word error minimization and other applications of confusion networks [J].Computer Speech And Language,2000,14,373-400.
    [55]Rong Zhang and Alexander I. Rudnicky. Word level confidence annotation using combination of features. [A] Proc. of EuroSpeech.[C] Scandinavia.2001.
    [56]Guyon I.,Weston J., et al. Gene selection for cancer classification using support vector machines. Machine learning 2002,46:389-422.
    [57]SJ. Young. Detecting misrecognitions and out-of-vocabulary words[A].In:Proc. ICASSP[C],1994,vo 1.2,21-24.
    [58]F. Wessel, K. Macherey, R. Schfuter. Using word probabilities as confidence measures [A]. In:Proc. ICASSP[C],1998, vol.1,225-228.
    [59]李红莲,何伟,袁保宗.一种文本相似度及其在语音识别中的应用[J].中文信息学报,2003,17(1):60-64.
    [60]F. Wessel.Word Posterior Probabilities for Large Vocabulary Continuous Speech Recognition[D].Aachen University, Germany, January 2002. pp.85.
    [61]A. Lee, K. Shikano, T. Kawahara. Real-Time Word Confidence Scoring using Local Posterior probabilities on Tree Trellis Search[A].In:Proc. ICASSP[C],2004,793-796.
    [62]F. K. Soong, E. F. Huang. A tree-trellis based fast search for finding the N-best sentence hypotheses in continuous speech recognition.[A]. In:Proc. ICASSP[C],1991, vol.1,705-708.
    [63]Zhiyou Ma, Yingchun Yang, Zhaohui Wu. Further Feature Extraction for Speaker Recognition. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol.5.2003. pp.4153-4158.
    [64]WESTON J MUKHERJEE S CHAPELLE O et al. Feature selection for SVMs[A]. Advances in Neural Information Processing Systems Vol.13[M].Cambridge Massachusetts MIT Press 2000.668-674.
    [65]GRANDVALET Y CANU S. Adaptive scaling for feature selection in svms[A].Advances in Neural Information Processing Systems Vol.15[M].Cambridge Massachusetts MIT Press 2003.553-560.
    [66]JAPKOWICZ N, STEPHENS.The class imbalance problem:a systematic study[J]. Intelligent Data Analysis,2002.6(5). pp.203-231.
    [67]JOT, JAPKOW ICZ N. Class imbalances versus small disjuncts[J].S IGKDD Explorations, 2004.6(1) pp.40-49.
    [68]PRATIR C, BATISTA, MONARD M C. Learning with class skews and small disjuncts[C]. Proc of the 17th Brazilian Symposium on Artificial Intelligence.2004. pp.296-306.
    [69]MERZ C J, MURPHY PM. repository of machine learning data bases.1999. http://www. ics. uci. edu/mlearn/MLRepository. html.
    [70]PRATIR C, BATISTA, MONARD M C.Class imbalances versus class overlapping:an analysis of a learning system behavior[C]//Proc of the 3rd Mexican International Conference on Artificial Intelligence.2004. pp.312-321.
    [71]CHAWLA NV, BOWYER KW, HALL LO, et al. SMOTE:synthetic minority oversampling technique[J].Journal of Artificial Intelligence Research,2002. pp.321-357.
    [72]ZHENG Zhao hui, WU Xiaoyun, SRIHAR IR. Feature selection for text categorization on imbalanced data[J].SIGKDD Explorations,2004.6 (1) pp.80-89.
    [73]HAWLA NV,JAPKOW ICZN,KOLCZA. Editorial:special issue on learning from imbalanced data sets[J].ACM SIGKDD Exploration News letter.2004.6 (1) pp.1-6.
    [74]GUSTAVO EA, BATISTA PA, RONALDO C, et al. A study of the behavior of several methods for balancing machine learning training data[J].SIGKDD Explorations.2004.6(1) pp.20-29.
    [75]TAEHO J, NATHALIE J. Class imbalances versus small disjuncts [J].S IGKDD Explorations,2004.6 (1) pp.40-49.
    [76]HUANG Kaizhu, YANG Haiqin, KING I, et al. Learning classifiers from imbalanced data based on biased minimax probability machine[C].Proc of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.2004. pp.558-563.
    [77]AKBAN IR, KWEK S, JAPKOW ICZN.App lying support vector machines to imbalanced datasets[C].Proc of the 15th European Conference on Machines Learning.2004. pp.39-50.
    [78]IMAM T, TING KM,KANMRUZZAMAN J. An SVM for improved classification of imbalanced data[A].Australian Joint Conference on AI[C].Hobart, Australia:Springer.2006. pp.264-273.
    [79]HUANG Kai-zhu, YANG Hai-qin, KING I, et al. Learning classifiers from imbalanced data based on biased minimax probability machine. Proc of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.2004. pp.558-563.
    [80]MARGINEANTU D D, DIETTERICH T G. Bootstrap methods for the cost-sensitive evaluation of classifiers. Proc of International Conference on Machine Learning.2000. pp. 582-590.
    [81]MANEVITZ L M,YOUSEFM. One-class VMs for document classification. Journal of Machine Learning Research.2001.2 (1) pp.139-154.
    [82]AKBANIR, KWEKS, JAPKOWICZN. Applying support vector machines to imbalanced datasets. Proc of the 15th European Conference on Machines Learning.2004. pp.39-50.
    [83]VEROPOULOS K, CAMPBELL C, CRISTIAN IN IN. Controlling the sensitivity of support vector machines. Proc of International Joint Conference on Al.1999. pp.55-60.
    [84]Pitz, M., F. Wessel, and H. Ney. Improved MLLR speaker adaptation using confidence measures for conversational speech recognition. ICSLP 2000. vol.4. pp.548-551.
    [85]Wessel, F. and H. Ney. Unsupervised training of acoustic models for large vocabulary continuous speech recognition.in Proceedings Automatic Speech Recognition Workshop 2001.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700