Unsupervised accent classification for deep data fusion of accent and language information

详细信息查看全文

作者：John H.L. Hansen ; ^{john.hansen@utdallas.edu" class="auth_mail" title="E-mail the corresponding author} ; ; Gang Liu
关键词：NLP ; TF-IDF ; Accent classification ; Dialect identification ; UT-Podcast
刊名：Speech Communication
出版年：2016
出版时间：April 2016
年：2016
卷：78
期：Complete
页码：19-33
全文大小：1335 K

文摘

Automatic Dialect Identification (DID) has recently gained substantial interest in the speech processing community. Studies have shown that the variation in speech due to dialect is a factor which significantly impacts speech system performance. Dialects differ in various ways such as acoustic traits (phonetic realization of vowels and consonants, rhythmical characteristics, prosody) and content based word selection (grammar, vocabulary, phonetic distribution, lexical distribution, semantics). The traditional DID classifier is usually based on Gaussian Mixture Modeling (GMM), which is employed as baseline system. We investigate various methods of improving the DID based on acoustic and text language sub-systems to further boost the performance. For acoustic approach, we propose to use i-Vector system. For text language based dialect classification, a series of natural language processing (NLP) techniques are explored to address word selection and grammar factors, which cannot be modeled using an acoustic modeling system. These NLP techniques include: two traditional approaches, including N-Gram modeling and Latent Semantic Analysis (LSA), and a novel approach based on Term Frequency–Inverse Document Frequency (TF-IDF) and logistic regression classification. Due to the sparsity of training data, traditional text approaches do not offer superior performance. However, the proposed TF-IDF approach shows comparable performance to the i-Vector acoustic system, which when fused with the i-Vector system results in a final audio-text combined solution that is more discriminative. Compared with the GMM baseline system, the proposed audio-text DID system provides a relative improvement in dialect classification performance of +40.1% and +47.1% on the self-collected corpus (UT-Podcast) and NIST LRE-2009 data, respectively. The experiment results validate the feasibility of leveraging both acoustic and textual information in achieving improved DID performance.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700