极小化标注的音频分类和句子切分的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
语音库的自动建设在可训练的语音合成中占有很重要的地位,它要求对输入的音频进行类别的区分来进行不同的处理,并将处理后的音频分割为句子作为后续的音段切分系统的输入。音频分类和句子切分技术是解决这一问题的关键。此外,已有的音频分类和句子切分技术都需要大量的人工标注数据来训练模型和测试分类结果,但人工标注费时费力,很大程度上增加了系统构建的成本。在这种背景下,极小化标注的音频分类和句子切分的研究有很高的理论研究及使用价值。对此,本论文在基于内容的音频分类和不依赖语音识别的句子切分方面,包括特征选择、极小化标注、关键技术改进以及相关技术应用,进行了深入而系统的研究,本论文具体的研究工作和研究成果如下。
     1)深入分析了音频信息的主要来源和音频的语义内容,根据所采用的新闻朗读音频的特点,将音频分为:纯语音,纯音乐和音乐和语音的混合三类。从帧层次上和段层次上深入研究了不同类别音频之间的区别性特征,除了频域能量、过零率、MFCC参数等基础特征,还采用了新的特征:静音比率、High-ZCR比率和Low frequency energy比率。本文的一个创新点是,通过深入分析协同训练算法co-training在极小化标注数据量并保证分类精度方面的优势,采用基于最大熵分类的co-training算法进行音频分类。通过实验证明了co-training在音频分类上的性能。
     2)为实现极小化标注,深入研究了基于最大熵(Maxent)分类的协同训练算法co-training。Co-training是实现极小化标注的核心,通过研究比较了不同参数设置对分类精度的影响,综合时间代价及计算代价进行分析,确定了性能最优的一组参数。同时,针对音频分类和句子切分的数值分类方式,对Maxent分类器的分类方式进行调整。通过实验证明了co-training算法在极小化可用的人工标注数据量和二元分类方面的性能,为极小化标注的音频分类和句子切分的实现提供了坚实的基础。
     3)通过对依赖语音识别的句子切分方法的缺点的分析,深入研究韵律特征对句子切分的重要作用,据此对音频进行帧水平上的元音/辅音/停顿的分类,并采用了韵律特征、停顿特征和语速两个特征集,对音频进行基于语义的句子切分。为了实现句子切分的无标注特性,引入一种基于强制对齐和语音识别的带有检错机制的标注数据生成方法用于自动提供标注数据,并采用基于最大熵分类的co-training算法,解决了标注数据不足对分类精度的影响,实现了无标注的不依赖识别的句子边界探测。最后,针对无法确定探测出的句子边界是否为真正的边界的问题,提出一种检错机制,通过比对文本和元音/辅音/停顿分类后的音频上的元音个数的相应比例对句子切分的结果进行检错,以确定绝对准确的句子边界,直接用于后续的处理过程和系统中。本文的第二个创新点是实现了句子切分系统的无标注特性,并提出一种检错机制来确定和提取真正的句子边界。
Automatic building of voice database is of particular importance for speech synthesis. It requires distinguishing the category of input audio for different treatment, and segmenting the processed audio into sentences, which is taken as the input of following automatic syllabic segment cutting system. Audio classification and sentence segmentation are the key technologies to solving these problems. In addition, methods proposed of audio classification and sentence segmentation require a great quantity of manual label data to train the model and test the results, which is expensive, time-consuming and laborious to prepare, largely increased the cost of system construction. Due to this, research on label-minimized audio classification and sentence segmentation has high research value and application usage. Therefore, this thesis studies the topic of the content-based audio classification and sentence segmentation without speech recognition in depth and systematically, including feature selection, label minimizing, the key technology improvements and the related application. The detailed research works in this thesis are as follows.
     (1)The main sources of audio information and semantic content of audio are deeply analyzed and based on the characteristics of news broadcasting audio adopted, audio clip is classified into three classes:pure speech, pure music and speech mixed with music. Based on the deeply research of distinguishable characteristics of audio features in frame level and clip level, apart from basic features such as frequency energy, zero-crossing rate, MFCCs and so on, new features are introduced, including silence ratio and High ZCR ratio and Low frequency energy ratio. The first innovations of thesis is that through in-depth analysis on advantage of collaborative training algorithm co-training, in minimizing the amount of label data and guaranteeing the classification accuracy, the co-training algorithm based on maximum entropy (Maxent) is used for audio classification. Experimental results demonstrate the performances of co-training in the audio classification.
     (2)To implement the label-minimizing, the co-training algorithm based on maximum entropy classifier is studied in detail. Co-training is the core to realize the label-minimizing, through contrasting the effect of different parameter settings on the classification accuracy and comprehensive analysis of the cost of time and computation, the optimal set of parameters is determined. Meanwhile, the classification way of Maxent is adjusted for the numerical classification of audio classification and sentence segmentation. Experimental results prove the performances of co-training in binary classification and minimizing the amount of label data, which provides a solid foundation to the implementation of label-minimized audio classification system and sentence segmentation system.
     (3)Based on in-depth analysis of the shortage of sentence segmentation methods which rely heavily on the results of speech recognition, and research on the important role of prosodic features to sentence segmentation, the semantic sentence segmentation is performed on audios, by doing vowel/consonant/pause (V/C/P) classification to audios in the frame level and using prosodic features, pause features and rate of speed (ROS) as two feature sets. A label data generating approach with checking mechanism, based on forced alignment and speech recognition, is introduced to provide label data automatically and make sentence segmentation label-free. In addition, Maxent-based co-training is executed to solve the problem of insufficient label data and realize the sentence boundary detection without manual label and speech recognition. At last, a checking mechanism is proposed to solve the problem that it can not to make certain the boundary detected is a real sentence boundary or not, by contrasting the proportion of vowels on text with that on audio data after V/C/P classification. It can pick out the real sentence boundaries from boundaries detected form co-training, which can be used in following process and system directly. The second innovations of thesis is the realization of zero manual label to sentence segmentation, and the checking mechanism which can
引文
[1]R. E. Donovan, Trainable speech synthesis, Ph. D. thesis, Cambridge University,1996
    [2]X. Huang, et al., Whistler:a trainable text-to-speech system, Proc. of ICSLP, philadephia, pp.2387-2390,1996
    [3]T. Masuko, K. Tokuda, T. Kobayashi and S. Imai, Speech synthesis from HMMs using dynamic features, Proc. of ICASSP, pp.389-392,1996
    [4]Yeon-Jun Kim, Alistair Conkie, "Automatic Segmentation Combining an HMM-Based Approach and Spectral Boundary Correction",7th International Conference on Spoken Language Processing September 16-20,2002, Denver, Colorado, USA
    [5]Feiten B., Frank R., Ungvary T. Oragnization of sounds with neural nets. Proceedings of the 1991 international computer music conference, International computer music association, San Francisco,1991:441-444
    [6]Feiten B., Gunzel S. Automatic indexing of a sound database using self-organizing neural nets. Computer music journal,1994,18(3):53-65
    [7]E. Wold, T. Blum, and D. Keslar, Content-based classification, search, and retrieval of audio, IEEE Multimedia, Fall,1996:27-36
    [8]Saunder J., Real-time Discrimination of Broadcast Speech/Music, IEEE Acoustic, Speech, and Signal Processing, ICASSP'96,1996,2:993-996
    [9]Zhu Liu, Yao Wang, T Chen, Audio Feature Extraction and Analysis for Scene Segmentation and Classification, The Journal of VLSI Signal Processing,1998,20:61-79
    [10]Foote, Jonathen T., Content-Based Retrieval of Music and Audio, Proc. SPIE, Multimedia Storage and Archiving Systems,1997,3229:138-147
    [11]Gaunard P., Mubikanqiey C.G.., Couvreur C., Automatic Classification of Environmental Noise Events by Hidden Markov Models, IEEE Acoustics, Speech and Signal Processing, ICASSP'98,1998,6:3609-3612
    [12]Martin Kermit, Age J. Eide, Audio signal identification via pattern Capture and template matching, Pattern Recognition Letters,2000,21:269-275
    [13]S.Z.Li, Content-Based Classification and Retrieval of Audio Using the Nearest Feature Line Method, IEEE Transactions on Speech and Audio Processing,2000,8:619-62
    [14]Zhu Liu, Qian Huang, Classification of Audio Events in Broadcast News, IEEE Multimedia Signal Processing,1998:364-369
    [15]Hansen, J.H.L., Womack, Feature Analysis and Neural Network-Based Classification of Speech under Stress,1996,4:307-313
    [16]Aaron E. Rosenberg, Ivan Margrin-Chagnoleau, S. Parthasarathy et al, Speaker Detection in Broadcast Speech Databases, Proceedings of ICSLP'98,1998:1339-1342
    [17]GuoDong Guo, Li,S.Z., Content-Based audio Classification and Retrieval by Support Vector Machine, IEEE Transactions on Neural Networks,2003,14:209-215
    [18]Liu Z., Huang J., Wang Y., Classification of TV programs based on audio information using Hidden Markov Model, IEEE Multimedia Signal Processing,1998:27-32
    [19]L. Lu, H. Jiang, H. J. Zhang. Content Analysis for Audio Classification and Segmentation, IEEE Transactions on Speech and Audio Processing, vol.10, Oct 2002
    [20]白亮,音频分类与分割技术研究,硕士论文,国防科学技术大学,2004
    [21]丁剑剑,电视音频的分割分类算法研究与实现,硕士论文,哈尔滨工业大学,2005
    [22]Stolcke A., Shriberg E. Automatic linguistic segmentation of conversational speech. Proceedings of ICSLP, vol.2,1996
    [23]Meteer M., Iyer R. Modeling conversational speech for speech recognition. Proceedings of Conference on Empirical Methods in Natural Language Processing,1996
    [24]Gotoh Y., Renals S. Sentence Boundary Detection in Broadcast Speech Transcripts. Proceedings of ISCA ITRW Workshop,2000
    [25]E. Shriberg, A. Stolcke, "Prosody-Based Automatic Segmentation of Speech into Sentences and Topics," Speech Communication 32 (1-2), September 2000
    [26]Hillard D., Ostendorf M., Stolcke A., Liu Y., Shriber E. Improving Automatic Sentence Boundary Detection with Confusion Networks. Proceedings of HLT-NAACL,2004
    [27]Zimmermann M., Hakkani D., Fung J., Mirghafori N., E.Shriberg, Liu Y.The ICSI+Multi-Lingual Sentence Segmentation System. Proceedings of ICSLP,2006
    [28]Roark B., Liu Y., Harper M., Stewart R., Lease M., Snover M., Shafran I., Dorr B., Hale J., Krasnyanskaya A., Yung L. Reranking for Sentence Boundary Detection in Conversational Speech. Proceedings of ICASSP,2006
    [29]Liu Y, Shriberg E., Stolcke A., Peskin B., Ang J., Hillard D., Ostendorf M., Tomalin M., Woodland P., Harper M. Structural Metadata Research in the EARS Program. Proceedings of ICASSP,2005
    [30]Eisemstein J., Davis R. Gesture Features for Sentence Segmentation. Proceedings of the 6th International Workshop on Gesture in Human-Computer Interaction and Simulation,2005
    [31]苏淑玲.机器学习的发展现状及其相关研究.肇庆学院学报.2007,28(2):41-44
    [32]周志华,王珏 主编.机器学习及其应用.北京:清华大学出版社,2007,259-275
    [33]Z.-H. Zhou. Learning with unlabeled data and its application to image retrieval, Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence
    [34](PRICAI'06), Guilin, China, LNAI 4099,2006,5-10.
    [35]O. Chapelle, B. Scholkopf, A. Zien, eds. Semi-Supervised Learning, Cambridge, MA:MIT Press,2006.
    [36]X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI, Apr.2006.
    [37]V. N. Vapnik. Statistical Learning Theory, New York:Wiley,1998.
    [38]T. Joachims. Transductive inference for text classification using support vector machines, Proceedings of the 16th International Conference on Machine Learning (ICML'99), Bled, Slovenia,1999,200-209.
    [39]H. Seung, M. Opper, H. Sompolinsky. Query by committee. In:Proceedings of the 5th ACM Workshop on Computational Learning Theory (COLT'92), Pittsburgh, PA,1992, 287-294.
    [40]D. Lewis, W. Gale. A sequential algorithm for training text classifiers, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), Dublin, Ireland,1994,3-12.
    [41]N. Abe, H. Mamitsuka. Query learning strategies using boosting and bagging. In: Proceedings of the 15th International Conference on Machine Learning (ICML'98), Madison, WI,1998,1-9.
    [42]A. Blum, T. Mitchell. Combining labeled and unlabeled data with co-training, Proceedings of the 11th Annual Conference on Computational Learning Theory, New York:ACM Press, 1998:922100.
    [43]Maximum Entropy Modeling Toolkit for Python and C++ http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html[2008-4-11]
    [44]Adwait Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. Pennsylvania:University of Pennsylvania,1998
    [45]李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类.计算机研究与发展,2005,42(1):94-101
    [46]李恒峰,李国辉。基于内容的音频检索与分类。计算机工程与应用,2000(7):54-56
    [47]E. Scheirer, M. Slaney, Construction and evaluation of a robust multifeature speech discriminator, in ICASSP,1997
    [48]SPTK Reference Manual 3.3, http://sp-tk.sourceforge.net[2009-11-25]
    [49]白亮,老松杨,陈剑,吴玲达。音频自动分类中的特征分析和抽取,小型微型计算机系统,vol.26,no.11,2005年11月
    [50]Bai L, Lao Song-Yang, Chen Jian-Yun, et al, Audio Classification and Segmentation Based on Support Vector Machines[J]. Journal of Computer Sience,2006,26 (11):2029-2034
    [51]Lu Jian, Chen Yi-song, Sun Zheng-xing, Zhang Fu-yan, Automatic Audio Classification by Using Hidden Markov Model[J]. Journal of Software,2005,13(8)
    [52]R. Kompe, Prosody in Speech Understanding Systems, Springer-Verlag,1996.
    [53]Grosz B., Hirschberg J. Some intonational characteristics of discourse structure. In J. J.Ohala, T.M.Nearey B. L. Derwing, M. M. Hodge, and G. E. Wiebe (Eds.), Proceedings of the International Conference on Spoken Language Processing (Vol.1, pp.429-432). Banff, Canada,1992
    [54]Sluijter A., Terken, J. (1994). Beyond sentence prosody:Paragraph intonation in Dutch. Phonetica,50,180-188.29
    [55]Swerts M. (1997). Prosodic features at discourse boundaries of different strength. Journal of the Acoustical Society of America,101,514-521.
    [56]Swerts, M., and Geluykens, R. (1994). Prosody as a marker of information flow in spoken discourse. Language and Speech,37,21-43.
    [57]Swerts, M., and Ostendorf,M. (1997). Prosodic and lexical indications of discourse structure in human machine interactions. Speech Communication,22(1),25-41
    [58]Koopmans-van Beinum, F. J., and van Donzel, M. E. (1996). Relationship between discourse structure and dynamic speech rate. In H. T. Bunnell and W. Idsardi (Eds.), Proceedings of the International Conference on Spoken Language Processing (Vol.3, pp. 1724-1727).Philadelphia.
    [59]Hirschberg, J., and Nakatani, C. (1996). A prosodic analysis of discourse segments in direction giving monologues. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (pp.286-293). Santa Cruz, CA
    [60]Nakajima, S., and Tsukada, H. (1997). Prosodic features of utterances in taskoriented dialogues. In Y. Sagisaka, N. Campbell, and N. Higuchi (Eds.), Computing Prosody: Computational Models for Processing Spontaneous Speech (pp.81-94). New York: Springer.
    [61]ZHANG Wei, PANG Minhui, DU Ranran, LIU Yayu, Automatic Speech Sentence Segmentation from Multi-paragraph Databases, International Conference on Measuring Technology and Mechatronics Automation(ICMTMA2010)
    [62]S. Pfeiffer, "Pause Concepts for audio Segmentation at Different Semantic Levels", ACM Multimedia 2001, pp.187-193
    [63]N. Morgan and E. Fosler, "Combining Multiple Estimators of Speaking Rate," Proc. ICASSP'98, Seattle. pp.729-732, May 1998
    [64]Dong Wang, Lie Lu, Hong-Jiang Zhang. "Speech Segmentation without Speech Recognition", Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP03), Vol.Ⅰ, pp.468-471, Hong Kong, April 4-10,2003

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700