广播语音的自动标注系统

英文题名：An Automatic Labeling System for Broadcast News
作者：何可嘉
论文级别：硕士
学科专业名称：模式识别与智能系统
中文关键词：自动标注 ; 语音对齐 ; 动态规划 ; 音频分割 ; 声学模型自适应
英文关键词：automatic labeling ; speech alignment ; dynamic programming ; speech segmentation ; acoustic re-estimation
学位年度：2010
导师：郭军
学科代码：081104
学位授予单位：北京邮电大学
论文提交日期：2010-01-27

摘要

近年来,随着计算机技术、网络技术和通讯技术的快速发展,人们可方便快捷地获得图像、视频、音频等多媒体文件,全球多媒体数据出现爆炸式的增长,其中音频信息占有十分重要的地位。如何对海量级的多媒体资源进行有效的索引和信息检索成为当前一个研究热点,一种基于内容的音频检索方法对当今的大词汇量语音识别系统提出了更高的要求。
     在众多多媒体数据中,广播新闻是被大多数的多媒体研究课题重视的代表性音频数据,因为广播语音包含静音,音乐,说话人语音和噪音背景等音频要素。要提高广播语音识别的性能以及鲁棒性,需要大规模精确标注的语料库。众所周知,为大规模语音语料库添加标注需要大量人力、物力,而由于广播语音识别的性能还不够高,现阶段的标注工作还只能通过人工手动来完成。如何自动完成语音音频的文本标注是降低语音识别系统成本的重要方向。
     在这一背景下,本文研究搭建一个广播语音的自动标注系统。由于多数情况下,广播语音的音频文件和其相应的文本可在互联网找到,所以,本文的研究重点不再是单纯的识别问题,而是如果完成给定的已知文本与音频的对齐。
     本文提出了一种基于语音识别和动态规划找锚点(可信对齐区域)的递归对齐算法,这个算法可简单描述为：首先对连续音频进行语音识别得到识别文本,然后再对识别文本与已知文本进行文本内容的匹配对齐,通过文本匹配找到可信任的对齐区域(称作“锚点”),利用锚点将音频和已知文本分成已对齐部分及未对齐部分,然后对未对齐部分重复上述递归过程。本文中,根据标注语料的目的、已知文本可能含错、部分音频质量太差等现实因素提出三大改进：第一,以句子为单位完成音频与文本的对齐,为方便之后的人工修订；第二,采用DTW的动态规划算法找对齐锚点,利用DTW的容错性能降低错误文本对整个标注系统的影响；第三,对于音频质量特别差,以至于找不到对齐锚点的部分,本文采用声学模型自适应的算法来提高语音识别的性能完成对齐。
     基于对音频与文本对齐算法的研究,结合端点检测、语音检测以及说话人分割等音频分割技术,完成广播语音的自动标注系统的搭建工作,对广播语音实现内容简介层,说话人身份层,说话内容层三层信息的自动标注。其标注完成度达到89.2%,精确度达到98.9%的句子偏差在1秒之内,这大大降低了人工标注的工作量,为之后人工修复标注提供了可靠的辅助信息。
     同时,为了提高广播语音自动标注的性能和精度,本文还研究了端点检测、语音检测以及说话人分割等音频分割技术,并在广播新闻语料中进行了实验和性能分析。
As the development of the computer technology, network technology and communication technology, large amounts of audio and video content have become available over the Internet. Mining these data sources, either to facilitate their search via audio indexing engines has become an interesting area of research, which needs a large vocabulary speech recognition system with improved performance. As a typical multimedia data, broadcast news is mostly researched by speech researchers.
     Large amounts speech with quality labeling transcripts are indispensable for training acoustic model in automatic speech recognition (ASR), and typically the transcripts are generated by human. However, today's recognition systems are trained on hundreds or even thousands of hours of speech, and the continuous increase in the size of recognition training corpora may make quality labeling of training data prohibitively expensive.
     With the lucky that audio data and their corresponding transcripts are available on the internet in many cases. In other words, the problem is not to recognize words from the audio but to align the given text with the audio. This paper addresses the problem of aligning long speech recordings to their transcripts.
     We provide a recursive technique based on speech recognition and dynamic programming. The algorithm can be simply described as:a recursive speech recognition problem with a gradually restricting dictionary and language model extracted from corresponding transcript. The approach is based on the discovery of islands of confidence (called anchor) found via dynamic programming to align the correct transcript with hypothesized words from speech recognition. The transcript segment and the audio segment are partitioned into aligned and unaligned segments according to the anchors. These steps are repeated for any unaligned segments until a termination condition is reached. In this paper we have improved the recursive approach to deal with broadcast news in three primary differences. First, our approach explicitly searches in the unit of sentence which is easier for manual emending. Second, our approach uses the dynamic programming technique of Dynamic Time Warping (DTW) to align the correct transcript with hypothesis text, for the tolerance to a few errors can avoid some errors in transcripts making alignment failed. Third, for the segment that cannot found an anchor, we use the acoustic model re-estimation technique rather than changing the threshold of DP in recursive, it faces the fact that some segment in News Broadcast audio is too poor to recognize.
     In this paper we build an automatic labeling system for broadcast news, based on the recursive alignment algorithm and speech segmentation, including Voice Activity Detection, discrimination of speech and music, speaker segmentation. It completes the automatic labeling on three aspects, such as content contract, speaker and speech content. The completeness ratio is 89.2% and 98.9% of the sentences are off by less than 1 second from the real tags. The system can save a large amount of human labor consuming.
     To improve the performance of labeling system, this paper studied the technology of speech segmentation and experimented in broadcast news audio.

引文

[1]贾磊,穆向禺,徐波.广播语音的音频分割[J].中文信息学报,2002,16(1)：37-42.
    [2]张红,黄泰翼,徐波.广播电视新闻自动记录系统研究现状——语音识别的重要应用[J].自动化学院,2001,27(3)：339-345
    [3]Erling Wld,etc. Context-Based Classification, Search, and Retrieval of Audio. IEEE Multimedia,1996 Fall:27-36
    [4]SCH IEL F. Automatic phonetic transcription of non-prompted speech [C]//Proceedings of 1999 International Conference of Phonetic Sciences. San Francisco,USA,1999:607-610.
    [5]DEMUYNCK K,LAUREYS T,GILL IS S. Automatic generation of phonetic transcriptions for large speech corpora [C]//Proceedings of the 7th International Conference on Spoken Language Processing.Denver,USA,2002:333-336.
    [6]CHEN S S,E IDE E,GALESM J F, et al. Automatic transcription of broadcast news [J].Speech Communication,2002,37(1/2):69-87.
    [7]CHAN H Y,WOODLAND P. Improving broadcast news transcription by lightly supervised discriminative training [C]//Proceedings of 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. Montreal, Canada,2004,1:737-740.
    [8]KATO K,NANJO H,KAWAHA.RA T. Automatic transcription of lecture speech using topic-independent language modeling[C]//Proceedings of the Sixth International Conference on Spoken Language Processing. Beijing, China, 2000:162-165.
    [9]BACCH IAN I M. Automatic transcription of voicemail at AT&T [C]//Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. SalLake City, USA,2001,1:25-28.
    [10]HA IN T, BURGET L,D INES J, et al. The 2005 AMI system for the transcription of speech in meetings [J].Lecture Notes in Computer Science,2006,3869:450-462.
    [11]D. Caseiro, H. Meinedo, A. Serralheiso, I. Trancoso, and J. Neto, "Spoken book alignment using WFSTs." In Proceedings of the Human Language Technology Conference, San Diego, March 2002.
    [12]P. Moreno, C. Joerg, J.-M. Van Thong, and O. Glickman, "A recursive algorithm for the forced alignment of very long audio segments." In Proc. of ICSLP, Sydney, Australia, December 1998.
    [13]陈柏林,http://berlin.csie.ntnu.edu.tw/Courses,2008
    [14]Huang Xuedong, Acero Alex, Hon Hsiao-Wuen et al., Spoken language processing:a guide to theory, algorithm and system development, Prentice HALL PTR,2001.
    [15]Joseph W.P., Signal modeling techniques in speech recognition, In Proceedings of the IEEE,1993,81(9):1214-1247
    [16]Rabiner L.R., A tutorial on hidden Markov models and selected applications in speech recognition, In Proceedings of the IEEE,1989,77(2):257-285
    [17]王坚.语音识别中的说话人自适应研究[D].北京邮电大学.2007.
    [18]Daniel Jurafsky & James H. Martin, Speech and language processing:An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2ed). Prentice-Hall,2006
    [19]X. Huang, A. Acero, H. Hon, Spoken Language Processing:A Guide to Theory, Algorithm, and System Development, Prentice Hall,2001.
    [20]Aubert, X.L., An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition, Computer Speech and Language,2002. 16(1):89-114.
    [21]汪曦.嵌入式连续语音识别算法研究[D].北京邮电大学.2007.
    [22]孙成立.语音关键词识别技术的研究[D].北京邮电大学.2008.
    [23]吴涛.动态规划算法应用及其在时间效率上的优化[D].南京理工大学.2008.
    [24]J. R. Deller, J. G. Proakis, J. H. L. Hansen, "Discrete-time processing of speech signals," New York:Macmillan Pub. Co.,1993.
    [25]李永健.基于DTW和HMM的语音识别算法仿真及软件设计[D].哈尔滨工程大学.2009.
    [26]M. Siegler, U. Jain, B. Ray & R. Stern, Automatic segmentation, classification and clustering of broadcast news audio, Proceedings of the Speech Recognition Workshop,1997,97-99
    [27]易克初,田斌,付强,语音信号处理[M]北京：国防工业出版社,2000：154-233
    [28]李晋.语音信号端点检测算法研究[D].湖南师范大学．2006.
    [29][美]拉宾纳LR,谢弗RW.语音信号数字处理[M].北京：科学出版社,1983
    [30]袁晓,刘光远.数字语音信号包络提取算法研究[J].计算机科学,1998,25(3)：88-90
    [31]陈斐利,朱杰.一种新的基于自相关相似距离的语音信号端点检测方法.上海交通大学学报,1999,33(9)：1097-1099
    [32]王博,郭英,段艳丽,陈琪.基于倒谱特征的语音端点检测算法研究.信号处理,2005,200521(z1)
    [33]Kim Weaver, Khurram Waheed, Fathi M Salem. An entropy based robust speech boundary detection algorithm for realistic noisy environments. ICSP, 2003:680-685
    [34]李祖鹏,姚佩阳.一种语音段起止端点检测新方法.电讯技术,2000,(3)：68-70
    [35]徐大为,吴边,赵建伟.一种噪声环境下的实时语音端点检测算法.计算机工程与应用,2003,39(1)：115-117
    [36]聂惠娟,段世政.语音信号端点检测方法研究.新乡师范高等专科学校学报,2007,Vol.21.No.2：35-36
    [37]Lau YK. Speech recognition based on zero crossing rate and energy. ITASSP,1985,33:320-323
    [38]李祺,马华东,冯硕.用于自动字幕生成系统的语音端点检测算法.Journal of Software, Vol.19, Supplement, December 2008, pp.96-103
    [39]Haigh J A, Mason J S. Robust Voice Activity Dtection Using Cepstral Features [J]. Proc IEEE TENCON,1993, (3) 321-324
    [40]严剑峰,付宇卓.,一种新的基于信息熵的带噪语音端点检测方法[J] 计算机仿真2005,22(11)：117—119
    [41]Rabiner L R. An algorithm for detemring the end points of isolated utterance. Bell System Technical Journal,1975,54:297-315
    [42]张雄伟,陈亮,杨吉斌,现代语音处理技术及应用[M] 北京：机械工业出版社,2003：30-31
    [43]E. Scheirer and M. Slaney, "Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator", Proc. ICASSP'97, Vol. Ⅱ, pp. 1331-1334, Munich, Germany, April 1997.
    [44]Julien Pinquier, Christine Senac and Regine Andre Obrecht, "Speech and Music Classification in Audio Documents", ICASSP 2002
    [45]陈宇超.广播语音的分割与分类研究[D].北京邮电大学.2009.
    [46]J. P. Campbell, JR., SENIOR MEMIBER, IEEE, Speaker Recognition:A Tutorial, Proceedings of the IEEE, VOI.85, No.9, Sep.1997
    [47]杨旻.多层次说话人分割及相关算法研究[D].浙江大学.2009.
    [48]Yang Min, Yang Yingchun, Wu Zhaohui. A pitch-based rapid speech segmentation for speaker indexing[J]. Seventh IEEE Inter-national Symposium on Multimedia,2005:6.
    [49]Atal.B.S., Effectiveness of Linear Prediction Characteristics of Speech Wave for Acoustic Speaker Identification and Verfication, Journal of the Acoustical Society of American.1974, VOI.55, pp.1034-1312
    [50]Furui S., Cepstral Analysis Technique for Automatic Speaker Verification, IEEE Trans.Acoustics, Speeeh, and Signall Proeessing,1981, VOI.29, No.2, pp.254-271
    [51]TOMMIE GANNERT. A Speaker Verification System under the Scope:Alize, TRITA-CSC-E 2007:029. ISRN-KTH/CSC/E--07/029—SE. ISSN-1653-5715
    [52]J.-F. Bonastre, F.Wills, S.Meignier, "ALIZE,a free Toolkit for Speaker Recognition", Proc.of International Conference on Acoustics Speech and Signal Processing (ICASSP 2005), Philadelphia, USA,2005.
    [53]Chen, S.S., Gopalakrishnan, P.S. Clustering via the bayesian information criterion with applications in speech recognition.In:Proceedings of the ICASSP'98, Vol.2, Seattle, Washington:IEEE,1998.645-648.
    [54]卢坚毛兵等.一种改进的基于说话者的语音分割算法[J].软件学报,2002,Vol.13.No.2.pp274-279
    [55]张薇,刘加.电话语音的多说话人分割聚类研究[J].清华大学学报(自然科学版),2008,Vol.48,No.4,pp575-578
    [56]LU L, ZHANG H J. Speaker change detection and tracking in real-time news broadcasting analysis[A]. Proc ACM Multimedia, Juan-les-Pins[C]. France, 2002.602-610
    [57]H. Gish, M-H.Siu, and R.Rohlieck, Segregation of speakers for speech recognition and speaker identification, in ICASSP, pp.873—876.1991.
    [58]L. Lu, H. Zhang, Speaker Change Detcetion and Tracking in Real-Time News Boardcasting Analysis, Multimedia'02, December 1-6,2002.
    [59]P. Delacourt, D. Kryze, and C.J. Wellekens, Speaker-Based Segmentation for Audio Data Indexing, CNET 98.
    [60]S. Kwon and S. Narayana, A Method for On-line Speaker Indexing Using Generic Reference Models, Eurospeech 2003, p.2653—2656,2003.
    [61]S.S. Chen and P.S. Gopalakrishnan. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion, IBM T.J. Watson Research Center, Yorktown Heights, NY, Tech.Rep.,1998.
    [62]Olivier et al., Applied Clustering for Automatic Speaker-based Segmentation of Audio Material, Beigian Journal of Operations Research, Statistics and Computer Seience (JORBEL) Special Issue:OR and Statistics in the Universities of Mons., volume 41, No.1-2,2001.
    [63]MA.Siegler, U.Jain, B.Raj, RM Stern, Automatic segmentation, classification and clustering of broadcast news, in DARAP Proc, Speech Recognition Workshop,1998.
    [64]SCHWARZ G. Estimationg the dimension of a model[J]. The Annals of Statistics.1978.6:461-464.
    [65]Delacourt,P., Wellekens,C.J.DISTBIC:a speaker-based segmentation for audio data indexing.Speech Communication,2000,32(1-2):111-126.
    [66]Delacourt,P., Wellejkens,C.J. Audio data indexing:use of second-order statistics for speaker-based segmentation. Proceedings of the IEEE International Conference on Multimedia Computing and Systems (ICMCS'1999), Vol.2. Florence, Italy:IEEE,1999.959-963.
    [67]David Aldous, "Longest Increasing Subsequence:From Patience Sorting to the Baik-Deift-Johansson Theorem", University of California Berkeley CA 94820, 1999

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700