英语作为第二语言的多媒体语音数据库设计制作及初步测试

英文题名：A Design Execution and Recognition Testing of Multimedia English Database of Second Language
作者：苏意玲
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：语音识别(ASR) ; 识别率 ; 人机接口 ; 多媒体语音数据库 ; 美尔频率倒谱系数(MFCC)
英文关键词：automatic speech recognition (ASR) ; recognition rate ; man-machine interface ; multimedia speech database ; Mel Frequency Cepstrum Coefficient (MFCC)
学位年度：2007
导师：李坚石 ; 韦元军
学科代码：081203
学位授予单位：贵州大学
论文提交日期：2007-05-01

摘要

语音识别是一门交叉学科，语音识别正逐步成为信息技术中人机接口的关键技术，近年来，计算机语音识别的应用有了长足的进展，基于英语的特殊地位，世界上对于英语作为第一语言的语音数据库的设计和制作已经很多。但由于英语的日益普及，以英语作为第二语言的人们越来越多，因此建立一个以英语作为第二语言的语音数据库是很有必要的。
     不同的国家，有不同的语言，其发音都有各自的特点，从而影响了作为第二语言的英语发音也出现了不同的特色。我们这里主要考虑在中国地区，设计与制作以英语作为第二语言的语音数据库，并对该数据库在构建的HTK语音识别系统中进行了一系列的测试。
     本文所做工作及创新有以下：
     1、在Linux环境下，构建了HTK语音识别系统。
     2、我们对特征参数做了一些研究改进：在识别系统中采用能够反映人对语音的感知特征的美尔频率倒谱系数(MFCC)作为特征参数，将语音信号的动态特征(瞬变特征)也加以考虑，实验证明这种增加混合特征参数的方法，能使系统的识别率有显著的提高。比较了各种参数的识别率，得到了识别效果最佳时的特征参数。
     3、在模型训练时，采用了隐马尔可夫模型，实验测试了不同的状态数，得到了为10将达到最好的识别效果。
     4、对语音数据库进行了设计、制作及训练模型过程，实验测试(参数取前面实验的结论)标准语音库的数据(采用AVICAR现有的数据库)和收集的语音库数据，进行了比较。发现收集的语音数据的识别率大大低于标准语音数据的识别率，得出收集不同地域语音数据库重要性的结论。分析了识别率低的原因；然后对收集的语音库中的数据根据不同的地域进行相互比较，总结识别率差异的原因，为设计制作语音库提供了借鉴的经验。
     5、对训练的模型进行了改进：将TIDIGIT中的中国人语音数据挑选出来，加入一次、两次、三次到AVICAR中的数据中一起训练模型，再对进行识别的测试比较，分析结果得到，识别率有所提高，由此可见，利用针对地域性强的模型进行语音识别，将大大提高识别效果。
Automatic speech recognition (ASR) is a multi-discipline research subject. It is becoming a key technology of man-machine interface in the information word gradually. Nowadays, ASR has made a great progress toward to provide high accuracy for users. Due to its especial status of English, there are so many English databases of first language in the world. These databases play a major role for progress of ASR. As globalization progress, more and more people who use it as a second language. In order to provide high performance for these people, it is necessary to design a English database of second language.
     Different country has different language with special pronunciation. It will influence the English pronunciation of people who use it as second language. In this paper we investigate how to design and create an English database of second language in China. Then we have collected data and tested it in the HTK.
     In this paper, we have done the work as follow:
     1、Constructed the HTK ASR system in Linux operate system.
     2、We use Mel Frequency Cepstrum Coefficient (MFCC) which can reflect phonetic characteristic in the automatic speech recognition system. In this way ,the dynamic feature of speech signal has been considered. The experiment proves the method of this kind of characteristic parameter of the increment mixture, can make the system recognition rate has been greatly improved. We did some research of improvement to the characteristic's parameter, compared recognition rate of various parameter, and got the characteristic parameter when recognition rate reached the highest.
     3、We used the Hidden Markov Model(HMM) to train model. We tested different states, and found when the states got 10 the system can reach the best effect.
     4、we have introduced the progress of how to design and create the speech database and how to train the speech model. Then we have compared the testing data between standard data (AVICAR data) and collected data. We have discovered that the recognition rate for English digit fell greatly for Chinese speakers in our database. This demonstrates the necessarity of building such database. Then we analyzed the reason of low recognition rate. In the end we compared the testing data between different area in China, and sum-up the reason of different recognition rate. Our investigation provides, experience for designing and creation of speech database of English as second language.
     5、We picked Chinese speech data out from TIDIGIT and join once, two, three times into the data of the AVICAR to train model, and compared the different models. We have discovered that the recognition rate has been improved. This demonstrates using suitable model will raise recognition rate consumedly.

引文

[王志明等 2005] 王志明，蔡莲红，艾海舟．语音识别技术的发展[J]．计算机研究与发展，2005，4(7)：185～190．
    [王炳锡等 2005] 王炳锡，屈丹，彭煊．实用语音识别基础[M]．北京：国防工业出版社，2005．
    [石志熹等 1996] 石志熹，张文全．一种建立语音库的技术[J]．山东电子，1996，19(4)：86-89．
    [朱亚吉等 1997] 朱亚吉，柴佩琪．语音合成系统中语音库的设计与实现[J]．计算机工程，1997，23(1)：178-183．
    [刘鹏等 2005] 刘鹏，王作英．多模式汉语连续语音识别中视觉特征的提取和应用[J]．计算机应用，2005，4(6)：57～60．
    [吴丹等 2004] 吴丹，林学言．人脸表情视频数据库的设计与实现[J]．计算机工程与应用．2004，3(4)：311～318．
    [苏意玲等 2006] 苏意玲，韦元军，李坚石．英文作为第二语言的多媒体语音数据库的设计和制作[J]．贵州大学学报(自然科学版)，2006，23(2)：78～80．
    [张家渌等 1993] 张家渌，齐士铃，吕士楠．汉语综合资料库的设计[M]．北京：中国科学院声学研究所，1993．
    [周君等 2002] 周君，王闵，范京．在汉语语音识别中语速、音量和音调调整的研究[J]．西安电子科技大学硕士学位论文，2002，1(2)：5～8．
    [周治等 2000] 周治，杜利民，徐彦君．汉语听觉视觉双模态信息的互补作用[J]．中国科学，2000，30：283～288．
    [周昊朗等 2003] 周昊朗，王岚，吴玺宏，迟惠生．一个面向说话人识别的汉语语音数据库[D]．北京：北京大学信息科学中心听觉研究室，2003．
    [易克初等 2000] 易克初，田斌，付强．语音信号处理[M]．北京：国防工业出版社，2000．
    [郑方等 1992] 郑方，吴文虎．非特定人连续数字识别方法与汉语语音数据库的研究[J]．清华大学计算机应用硕士论文，1992，5：83～109．
    [胡宇 2004] 胡宇．中国方言的分类[J]．南京：中国江苏文字网，2004．
    [胡春静等 1995] 胡春静，吴善培．不定人语音识别系统[J]．北京邮电大学学报，1995，18(1)：368～375．
    [熊吉春等 1999] 熊吉春，邬长安．关于语音合成语料库管理系统的开发[J]．信阳师范学院学报(自然科学版)，1999，12：254～268．
    [Bowon Lee 2002] Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, Thomas Huang. Avi Car: Audio-Visual Speech Corpus in a Car Environment[M]. Urbana: Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, 2002.
    [Briony Williams 2005] Briony Williams, Delyth Prys andAilbhe Ní Chasaide. Creating an ongoing research capability in speech technology for two minority languages: experiences from theWlSPR project[J]. Interspeech Canolfan Bedwyr, University of Wales, 2005, 12: 214～256.
    [Julius 2001] Julius. Multipurpose. Large Vocabulary Continuous Speech Recognition Engine[EB/OL], 2001, 2(16): 489～493.
    [Juliet Mar 2007] Juliet Mar. Appen Pty Ltd acting on behalf of Microsoft Corporation[J]. SPEECON CANTONESE, 27 May 2007, 3(5):59～64.
    [Joyce Y. C. Chan 2005] Joyce Y. C. Chan, P. C. Ching and Tan Lee. Development of a Cantonese-English Code-mixing Speech Corpus [J]. Inerspeech Department of Electronic Engineering The Chinese University of Hong Kong, 2005, 10(13):341～355.
    [Katarina Bartkova 2005] Katarina Bartkova, Denis Jouvet. Multiple models for improved speech recognition for non-native speakers[J]. France Telecom-Division R&D/TECH/SSTP, 2005, 2(17): 230～241.
    [Ktmiko Nielsen 2005] Kuniko Nielsen. Segmental differences in the visual contribution to speech intelligibility. Interspeech UCLA Department of Linguistics, 2005, 12(7): 678～685.
    [LS Lee 1997] LS Lee. Voice dictation of mandarin Chinese[J]. IEEE Signal Processing Magazine, 1997, 13(9): 93～101.
    [Lawrence Rabiner 1999] Lawrence Rabiner, Biing-Hwang Juang. Fundamentals Of Speech Recognition[J]. Prentice Hall PTR Englewood Cliffs, 1999, 8: 56～85.
    [Mike Lincoln 2006] Mike Lincoln, Iain McCowan, Jithendra Vepa, Hari Krishna Maganti. The Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSH-AV): Specification And Initial Experiments[J]. Centre for Speech Technology Research University of Edinburgh Bucchleuch Place IDIA Research Institute Martigny,2006,14(12):592～604.
    [Maria Jose 2005] Maria Jose, Sanchez Martinez, Juan Pablo de la Cruz, Gutierrez. AudioVisual Speech Recognition Using Motion Based Lipreading[J].Interspeech Infineon Technologies AG Corporate Research System Technology Munich, 2005,19:709～715.
    [Petr Cisar 2005] Petr Cisar, Milos Z elezny, Zdenek Krnoul. 3D Lip-tracking for Audio-Visual Speech Recognition in Real Applications[J]. Interspeech Department of Cybernetics, 2005,36(2):189～192.
    [Roland Goecke 2005] Roland Goecke, J Bruce Millar.The Audio-Video Australian English Speech Data Corpus AVOZES[J]. Interspeech Canberra Laboratory, 2005,7:89～93.
    [Rongqing Huang 2004] Rongqing Huang, John H.L, Hansen. Advances in Word based Dialect/Accent Classification[J]. Interspeech, 2004,10(4): 280～284.
    [Seok-Chae Rhee 2002] Seok-Chae Rhee, Sook-Hyang Lee,Seok-Keun Kang,Young-Ju Lee. Design and Construction of Korean-Spoken English Corpus[M]. Korea:Yonsei University,2002.
    [Steve Young 1995] Steve Young,Gunnar Evermann,Mark Gales,Thomas Hain,Dan Kershaw,Gareth Moore,Julian Odell,Dave Ollason,Dan Povey,Valtcho Valtchev,Valtcho Valtchev,Phil Woodland The HTK. Book[M].USA: Cambridge University Engineering Department, 1995.
    [S J Young 1994] S J Young, P C Woodland. State Clustering in HMM-based Continuous Speech Recognition.Computer Speech and Language, 1994,8(4):369～384.
    [Vladimir.L 2004] Vladimir.L.Arlazarov, Dimitri.S.Bogdanov, Olga.F.Krivnova, AleKsandra.Podrabinovitch. Creation of Russian Speech Databases Design, Processing, Development Tools[M]. Moscow Russia: Institute for System Analysis of Russian Academy of Science,2004.
    [Vitor Pera 2004] Vitor Pera,Antonio Moura,Diamantino Freitas. A New Multi-Modal Database for Developing Speech Recognition Systems for an Assistive Technology Application[J]. Faculty of Engineering, University of Potro, 2004,4:459～465.
    [Wai-Sum Lee 2005] Wai-Sum Lee. A phonetic study of the'er-hua' rimes in Beijing Mandarin[J]. Hong Kong.Interspeech,2005,31(7):525～536.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700