基于动态时间规整的语音样例快速检索算法

英文篇名：Fast query-by-example spoken term detection algorithm based on dynamic time warping
作者：张连海 ; 冯志远 ; 陈琦 ; 李勃昊
英文作者：ZHANG Lian-hai;FENG Zhi-yuan;CHEN Qi;LI Bo-hao;Institute of Information System Engineering,Information Engineering University;
关键词：语音样例检索 ; 音素后验概率 ; 分段累积近似下界估计 ; 动态时间规整 ; 内积距离
英文关键词：query-by-example spoken term detection;;phone posterior probability;;piecewise aggregate approximation lowerbound estimate;;dynamic time warping;;inner-product distance
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：信息工程大学信息系统工程学院;
出版日期：2014-04-15 23:43
出版单位：计算机应用研究
年：2014
期：v.31;No.272
基金：国家自然科学基金资助项目(61175017)
语种：中文;
页：JSYJ201406021
页数：5
CN：06
ISSN：51-1196/TP
分类号：94-98

摘要

为了提高基于DTW算法的语音检索系统的速度,提出了一种基于分段累积近似下界估计的动态时间规整算法,实现语音样例快速检索。该方法首先提取查询样例和测试集的音素后验概率作为特征参数,然后计算语音样例和测试集中所有候选分段实际动态规整得分的分段累积近似下界估计,最后采用K-最近邻算法与动态时间规整算法搜索与语音样例相似度最高的区域。实验结果表明,此算法的检索速度比直接运用DTW算法快6.32倍,而对其检索精度无任何影响。
In order to accelerate speed of speech retrieval system based on DTW,this paper presented a fast query-by-example spoken term detection method based on a piecewise aggregate approximation(PAA) lower-bound estimate(LBE) for dynamic time warping(DTW).In the method,it firstly extracted the phone posterior probabilities of query examples and test materials.Then it computed the piecewise aggregate approximation lower-bound estimates between the query example and every possible matching region in the corpus of utterances.Finally,it chose the K-nearest neighbor(KNN) and DTW to search for the relevant regions.Experimental results show that the detection speed of the new method is 6.32 times as fast as applying DTW directly,and there is no effect on the detection precision when compared with the latter.

引文

[1]SHEN Wa-de,WHITE C M,HAZEN T J.A comparison of query-byexample methods for spoken term detection[C]//Proc of the 10th Annual Conference International Speech Communication Association.2009:2143-2146.
    [2]CHELBA C,HAZEN T J,SARACLAR M.Retrieval and browsing of spoken content[J].IEEE Signal Processing Magazine,2008,25(3):39-49.
    [3]TZANETAKIS G,ERMOLINSKY A,COOK P.Pitch histograms in audio and symbolic music information retrieval[J].Journal of New Music Research,2003,32(2):143-152.
    [4]SARACLAR M,SPROAT R W.Lattice-based search for spoken utterance retrieval[C]//Proc of Human Language Technologies:The Annual Conference of the North American Chapter of the Association for Computational Linguistics.2004:129-136.
    [5]MILLER D R H,KLEBER M,KAO C,et al.Rapid and accurate spoken term detection[C]//Proc of the 8th Annual Conference of the International Speech Communication Association.2007:314-317.
    [6]NG K,ZUE V W.Subword-based approaches for spoken document retrieval[D].Cambridge:Massachusetts Institute of Technology,2000.
    [7]YU Peng,CHEN Kai-jiang,MA Cheng-yuan,et al.Vocabulary-independent indexing of spontaneous speech[J].IEEE Trans on Speech Audio Processing,2005,13(5):635-643.
    [8]HAZEN T J,SHEN Wa-de,WHITE C.Query-by-example spoken term detection using phonetic posteriorgram templates[C]//Proc of IEEE Workshop on Automatic Speech Recognition and Understanding.Merano/Meran,Italy:ASRU,2009:421-426.
    [9]TEJEDOR J,SZKE I,FAPO M.Novel methods for query selection and query combination in query-by-example spoken term detection[C]//Proc of International Workshop on Searching Spontaneous Conversational Speech.2010:15-20.
    [10]KEOGH E.Exact indexing of dynamic time warping[C]//Proc of the28th International Conference on Very Large Data Bases.2002:406-417.
    [11]RATH T M,MANMATHA R.Lower-bounding of dynamic time warping distances for multivariate time series,Technical Report MM-40[R].[S.l.]:University of Massachusetts Amherst,2003.
    [12]VLACHOS M,HADJIELEFTHERIOU M,GUNOPULOS D,et al.Indexing multi-dimensional time-series with support for multiple distance measures[C]//Proc of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2003:216-225.
    [13]ASAEI A,PICART B,BOURLARD H.Analysis of phone posterior feature space exploiting class specific sparsity and MLP based similarity measure[C]//Proc of IEEE International Conference on Acoustics,Speech,and Signal Processing.2010:4886-4889.
    [14]ZHANG Yao-dong,GLASS J R.Towards multi-speaker unsupervised speech pattern discovery[C]//Proc of IEEE International Conference on Acoustics,Speech,and Signal Processing.2010:4366-4369.
    [15]ZHANG Yao-dong,GLASS J R.An inner-product lower-bound estimate for dynamic time warping[C]//Proc of IEEE International Conference on Acoustics,Speech,and Signal Processing.2011:5660-5663.
    [16]SCHWARZ P.Phoneme recognition based on long temporal context[D].Brno:Brno University of Technology,2008.
    [17]STROM N.The NICO artificial neural network toolkit[EB/OL].http://nico.nikkostrom.com.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700