基于ASR的儿童语言教育系统的研究与实现

英文题名：Reserch and Implementation of Chrildren Speech-Triaining System Based on ASR
作者：许开维
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：ASR ; Speech ; API ; COM ; 儿童语言教育 ; 自适应技术
英文关键词：ASR ; Speech API ; COM ; Chrildren speech-training ; adaptive
学位年度：2006
导师：沈军
学科代码：081203
学位授予单位：东南大学
论文提交日期：2006-06-16
答辩委员会主席：陈冠清

摘要

随着现代计算机技术的普及和发展,计算机的使用越来越深入到人们的日常生活中。人类与计算机进行交流时,最直接和方便的方式就是语言交流,所以语音识别和语音合成技术已成了现代科技发展的一个标志,语音识别和语音合成也因此成为现代计算机技术研究和发展的重要领域之一。语音识别技术与多种学科的研究领域都有联系,这些领域的科研成果也成为推动语音识别技术发展的重要因素。语音识别技术已经取得了一些成就。但是,大多数语音识别系统仍局限于实验室中试用,远远没有达到实用化的要求。
     本文研究了语音自适应技术中两种常用的说话人自适应方法:最大后验概率(MAP)方法和最大似然线性回归(MLLR)方法。在此基础上,本文提出一种适合于语音识别的复合渐进自适应方法。这种新方法成功地结合了MAP和MLLR两种方法的优点。新方法使用了一个全局转移矩阵来简化MLLR模块,用来解决环境和说话人生理引起的差异,提供了更加精确的MAP模块初始模型。另外,渐进的MAP模块用来精细的刻画基于音素层次的差异,同时也确保了整个方法的渐进性。本文应用复合渐进方法对微软语音识别引擎进行了改进,在随后的验证性实验中,这种复合方法取得了较好的效果。实验证明,这种新方法能够有效地克服说话人差异和环境差异对识别系统的影响,能较好地适合语音识别系统的要求。
     在上述理论研究成果的基础上,本文结合了现代教育技术的成果与儿童语言教育的需求,成功地应用改进后的微软语音识别引擎开发了儿童语言教育软件,实现了中文语音识别、VC++、Flash和微软语音识别引擎之间的通讯、中文/拼音/英文语音识别、发音正误判断动画、TTS等功能。该软件形象直观,具有较强的实用性,是一种较为成功的儿童语言教育工具。
     本文通过对语音识别自适应技术中的方法研究,将其成果应用到了儿童语言教育实践中去,取得了较为良好的效果,具有较为理想的研究和应用价值。
With the progress of modern computer technology, more and more computer application is involved in everyday life. When people use computer, speech exchange with computer may be the most direct and convenient way. Therefore, Speech recognition and synthesis has become a significant mark of science and technology development, which becomes one of the important fields in computer research and development. The technology of speech recognition relates to multi-science. The achievement in these fields has contributed to the development of speech recognition. So far, most speech recognition system is still in its infancy and some problems will arise if migrated from lab, which is much far from practicality.
     This paper discussed various algorithms of adaptive techniques, especially focused on two classical methods:MAP (Maximum a Posteriors) and MLLR (Maximum Likelihood Linear Regression). Then, a new approach is presented in this paper, integrating MAP and MLLR for incremental adaptation. In the new approach, the simplified MLLR module uses a single globe regression class to minimize the mismatches caused by the environment and speaker anatomical differences, and provides a more accurate initial model to the MAP processing. The incremental MAP module is used for a further subtle removal of phoneme-level variations, and to ensure the asymptotic properties of the whole approach. We use the new approach to improve the Microsoft SDK, which is highly effective in our experiments. The results demonstrate that the new approach can effectively deal with both the speaker and environment variations, and is well suited for the speech recognition.
     Based on the above theoretical research, this paper combined the modern educational technology and the demand of children’s linguistic education; it successfully developed the software of children speech education by applying the improved Microsoft voice identification engine. It has fulfilled the functions of Chinese voice identification, the communication between VC++, Flash and the voice identification engine, the voice identification of Chinese and English, the correct and error cartoons of pronunciation, TTS, etc. This software is a successful tool of children’s linguistic education by its intuitive images and practical features.
     The paper applied the research in the automatic voice identification technology to the children’s linguistic education and gained satisfactory result, which has significance on two levels: the theoretical and the practical.

引文

[l]陈大为.基于 HMM 的说话人识别改进研究及应用[D],2002,3.
    [2]吴玺宏.声纹识别听声辨人.www.citforum.com, 2001,8,23.
    [3] Herbert Gish, Michael Schmidt. Text-independent speaker identification. IEEE Signal Processing Magazine[J], 1994(11),4,18-32.
    [4] Richard J. Mammone, Xiaoyu Zhang, Ravi P. Ramachandran. Robust speaker recognition---a feature based approach. IEEE Signal Processing Magazine[J],1996(3),5,58-71.
    [5] Joseph P Campbell, Jr. Speaker recognition : a tutorial. Proceedings of the IEEE[R], 1997(85),9,1437-1462.
    [6]岳喜才,叶大田.文本无关的说话人识别:综述.模式识别与人工智能[M], 2001(14),2,194-199.
    [7] Ravi P. Ramachandran, Kevin R. Farrell, Roopashri Ramachandran, Richard J.Mammone. Speaker recognition---general classifier approaches and data fusion methods. Pattern Recognition[J], 2002(35),12,2801-2821.
    [8]张雄伟,陈亮,杨吉斌.现代语音处理技术及应用[M].北京:机械工业出版社,2003.
    [9]Sadaoki Furui. Recent advances in speaker recognition. Pattern Recognition Letters[J],1997(18),9,859-872.
    [10]杨行峻,迟惠生等.语音信号数字处理[M].北京:电子工业出版社,1995.
    [11]George R. Doddington, Mark A. Przybocki, Alvin F. Martin, Douglas A. Reynolds. IVIST speaker recognition evaluation---overview, methodology, systems, results, perspectives[J]. Speech Communication, 2000 (31),2,225-254.
    [12] Alvin F. Martin, Mark A. Przybocki. The 1VIST 1999 speaker recognition evaluation---an overview[M]. Digital Signal Processing, 2000(10),1-3,1-18.
    [13]边肇祺,张学上等.模式识别(第二版)[M].北京:清华大学出版社,2000.
    [14]胡航.语音信号处理[M].哈尔滨:哈尔滨工业大学出版社,2000.
    [15] Stephane Mallat, Sifen Zhong. Characterization of signals from multiscale edges. IEEE Transaction on Pattern Analysis and Machine Intelligence[J],1992(14),7,710-732.
    [16]程正兴.小波分析算法与应用[M].西安:西安交通大学出版社,1998.
    [17] S. Kadambe, G F. Boudreaux-Bartels. Application of the wavelet transform for pitch detection of speech signal. IEEE Transactions on Information Theory[R], 1992(3 8),2,917-924.
    [18]M. S. Obaidat, Andy Brodzik, B. Sadoun. A performance evaluation study of four wavelet algorithms for the pitch period estimation of speech signals. Information Sciences[M], 1998(112),1-4,213-221.
    [19]M. S. Obaidat, C. Lee, B. Sadoun, D. Nelson. Estimation of pitch period of speech signal using a new 街adic wavelet algorithm. Information Sciences, 1999(119),1-2,21-39.
    [20] Akira Sasou, Shogo Nakamura. A pitch extraction method using the wavelet transform. Electronics and Communications in Japan[J], Part 3, 1999(82),6,36-45.
    [21]Peter Veprek, Michael S. Scordilis. Analysis, enhancement and evaluation of five pitch determination techniques[J]. Speech Communication, 2002(37),3-4,249-270.
    [22]鹿群,徐士林,陶维青.基于小波变换的汉语三字词基音提取[J].合肥工业大学学报(自然科学版),1998(21),5,32-37.
    [23]朱小燕,王呈,刘俊.汉语声调识别中的基音平滑新方法[J].中文信息学报,2001(15),2,45-50.
    [24] Makel(著),娄乃英等(译).语音信号线性预测[J].北京:中国铁道出版社,1987.
    [25] Gautam K. Vallabhan, Betty Tuller. Systematic errors in the formant analysis of steady-state vowels[M].Speech Communication, 2002(38),1-2,141-160.
    [26] Ada Fort, Claudia Manfredi. Acoustic analysis of newborn infant cry signals.Medical Engineering&Physics[J], 1998(20),6,432-442.
    [27] 冯炳锡.语音编码[M].西安:西安电子科技大学出版社,2002.
    [28] Ahmed Alani, Mohamed Deriche. A novel approach to speech segmentation using the wavelet transform[J]. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (ISSPA'99), 1999,127-130.
    [29] Hongtao Hu, Du Limin. A new method for automatic extraction of the voiced/unvoiced feature from Chinese continuous speech using wavelet transform Proceedings of International Conference on Signal Processing Proceedings(ICSP'98)[J], 1998,686-689.
    [30]E. Jafer, A. E. Mahdi. Wavelet-based voiced/unvoiced classification algorithm. Proceedings of the EURASIP Conference focused on Video/Image Processing and Multimedia Communications (EU-VIP-MC'2003)[M], 2003,127-129.
    [31]张贤达.现代信号处理(第二版[M]).北京:清华大学出版社,2002.
    [32] James F. Kaiser. On a simple algorithm to calculate the energy of a signal. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'90)[M], 1990,381-384.
    [33]Mohammed Bahoura, Jean Rouat. Wavelet speech enhancement based on the Teager energy operator[J]. IEEE Signal Processing Letters, 2001(8),1,10-12.
    [34]孙圣和,陆哲明.矢量量化技术及应用[M].北京:科学出版社,2002.
    [35]Vlasta Radova, Zdenek Svenda. Speaker identification based on vector quantization. Lecture Notes in Computer Science[M], 1999(1692),341-344.
    [36] Sergios Theodoridis, Konstantinos Koutroumbas. Pattern Recognition (Second Edition)[M].北京:机械工业出版社,2003.
    [37] Aapo Hyvarinen. Fast and robust fixed-point algorithms for independent component analysis[R]. IEEE Transactions on Neural Networks, 1999(10),3,626-634.
    [38]Aapo Hyvarinen. Survey on independent component analysis[J]. Neural Computing Surveys, 1999(2),94-128.
    [39]王昱. 语音识别自适应技术的研究与实现[D]. 北京:清华大学,2000
    [40]张颖,刘艳秋.软计算方法[M].北京:科学出版社,2002.
    [41]徐宗本,张讲让,郑亚林.计算智能中的仿生学:理论与算法[M].北京:科学出版社,2003.
    [42] Tom M. Mitchell(著),曾华军,张银奎等(译).机器学习[M].北京:机械工业出版社,2003.
    [43]Pasi Franti. Genetic algorithm with deterministic crossover for vector quantization. Pattern Recognition Letters, 2000(21)[M],1,61-68.
    [44]Juha Kivijarvi. Self-adaptive genetic algorithm for clustering[J]. Journal of Heuristics, 2003(9),2,113-129.
    [45] Sarunas J. Rauds, Anil K. Jain. Small sample size effects in statistical pattern recognition: recommendations for practitioners[j]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991(13),3,252-264.
    [46] D. Charlet, D. Jouvet. Optimizing feature set for speaker verification[J]. Pattern Recognition Letters, 1997(18),9,873-879.
    [47] A. Haydar, M. Demirekler, K. Yurtseven. Speaker identification through use of features selected using genetic algorithm. Electronics Letters[R],1998(34),1,39-40.
    [48] A. Haydar, M. Demirekler, K. Yurtseven. Feature selection using geneticRabiner L, Juang B. Fundamentals of Speech Recognition[M]. Englewood Cliff, New Jersey:Prentice-Hall, 1993
    [49] Tomi Kinnunen, Teemu Kilpelainen, Pasi Franti. Comparison of clustering algorithms in speaker identification. International Conference on Signal Processing and Communications (SPC'2000)[J], 2000,222-227.
    [50]D. S. Yeung, X. Z. Wang. Improving performance of similarity-based clustering by feature weight learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002(24),4,556-561.
    [51]Ningping Fan, Justinian Rosca. Enhanced VQ-based algorithms for speech independent speaker identification[J]. Lecture Notes in Computer Science 2003 (2688),470-477.
    [52] Tomi Kinnunen, Ismo Karkkainen. Class-discriminative weighted distortion measure for VQ-based speaker identification[J]. Lecture Notes in Computer Science, 2002(2396),681-688.
    [53]Tomi Kinnunen, Pasi Franti. Speaker discriminative weighting method for VQ-based speaker identification[J]. Lecture Notes in Computer Science, 2001(2091),150-156.
    [54]Isabelle Guyon, John Makhoul, Richard Schwartz, Vladimir Vapnic. What size test gives good error rate estimates?[J] IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998(20),1,52-64.
    [55] 雷静. 语音识别技术的研究及基本实现[D]]. 武汉:武汉理工大学,2002
    [56] Dharmendra S. Modha, W Scott Spangler. Feature weighting in k-means clustering[J]. Machine Learning, 2003(52),3,217-237.
    [57]盛青. 语音自动识别技术(ASR)及其软件实时实现[D].西北工业大学,2001
    [58]Speech SDK 自带的帮助文档
    [59]MSDN (http://msdn.microsoft.com)
    [60]刘晓华等编著. 精通 MFC[Z]. 北京:电子工业出版社,2003.9

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700