基于上下文的音视频标注研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机和网络技术的迅速发展,音频、视频等多媒体数据呈海量趋势不断增长。为了便于对这些繁杂数据的管理与利用,常见的处理方式是对其内容进行低层特征、结构信息、语义特征等不同层级的描述,其中,语义特征作为最贴近用户理解的描述形式得到了普遍关注,而基于机器学习的音视频标注作为获得这些描述的一种快速有效方式,也成为了当今的研究热点。然而,由于多媒体低层特征与高层语义之间存在的“语义鸿沟”,仅仅依赖学习算法本身往往很难达到满意的标注效果。在这种情况下,合理利用音视频数据丰富内容所蕴含的语义关联上下文、时间关联上下文、多模态关联上下文等信息将有助于缩小这个“语义鸿沟”,从而改善和提高音视频内容标注的准确性。
     本文以基于上下文的音视频标注为出发点,对当前标注方法中存在的一些关键问题进行了讨论,并对上述三种上下文的挖掘、建模、利用等问题展开深入研究,主要取得了以下成果:
     (1)针对音频标注中语义关联上下文利用不足的问题,提出基于关联主题混合高斯模型的音频概念检测算法,并探索了基于主题信息反馈的关键词检出。作为描述音视频内容的语义特征,标注单元之间会呈现出共现、约束等上下文关联,本文以一般音频和特殊音频——语音为出发点,对音频标注中这种语义关联上下文的挖掘和利用进行讨论。对于面向一般音频的多标记的音频概念检测,传统的处理方法忽略了语义概念之间的关联特性,本文算法则是将其嵌入至混合高斯模型框架中来指导检测过程,进而提高了检测准确性。而对于语音,本文从语音产生的角度出发,对说话人的原始表达意图进行基于文本分类的主题建模,尝试以此作为高层语义上下文来实现对关键词检出初始结果的进一步虚警剔除,在语音文档检索的应用中得到了有效验证。
     (2)分析了视频标注通常采用的通用概念关联的局限性,提出特定数据的两视角概念关联估计算法。语义关联上下文中的概念关联在标注过程中处于宏观指导地位,但通常采用的通用概念关联无法正确描述每一个待处理数据的概念分布,因此会导致以此为指导的视频标注不能达到期待中的效果。针对这一问题,本文尝试对具体待处理镜头和镜头对所隐含的空间和时间概念关联进行估计,将其转化为数据的分解与重建问题。在基于概率计算的视频标注优化中,面向TRECVID2006-2008数据集的实验测试以及与其它方法的比较表明本文算法得到的概念关联能够反映数据自身的语义内容,因此更为有效地提高了视频标注优化性能。
     (3)从对视频时间一致性的建模角度出发,提出图正则化的连续概率潜在语义分析模型,以及基于特征转换的视频概念检测算法。视频的时间特性决定了时间连续的视频片段可能具有相似的视觉和语义内容,本文模型基于这种时间一致性上下文的文档元素关联,对原始连续概率潜在语义分析中被忽略的元素关联通过基于图的流形正则化进行建模;在视频标注中,该模型除了用于特征映射,还作为一种产生式模型,由此得到的特征转换算法通过利用视频结构所隐含的上下文信息,克服了基于概率潜在语义分析的概率建模标注方法在视频标注中的局限。在YouTube和TRECVID数据集上的实验显示了本文模型及特征转换算法的有效性。
     (4)针对多模态关联上下文的有效利用问题,提出多模态连续概率潜在语义分析模型及其通用形式——图正则化的多模态连续概率潜在语义分析模型。描述同一个视频片段的音频、视频等不同模态特征相互关联彼此补充,合理的多模态融合方式应既能描述模态个体特性又能保持它们之间的关联。上述两个模型以此为出发点,前者在连续概率潜在语义分析框架下将多模态融合转化为多模态元素的建模问题,对每一个模态赋予一个混合高斯分布来描述其特征分布,并在基于分类的视频标注中有效完成了音视频融合;在此基础上,后者加入对多模态元素之间本质关联的建模,作为连续概率潜在语义分析、以及本文提出的多模态连续概率潜在语义分析和图正则化的连续概率潜在语义分析的通用形式,该模型进一步实现了对视频多模态和时间一致性等上下文的同时建模。
With great advances in computer and network technologies, multimedia data are increasing in an explosive way. For the convenience of exploiting and organizing these massive data, researchers attempt to describe their multimedia content in terms of low-level feature, video structure and semantic feature. Among these descriptions, semantic feature that provides benefits for human understanding has been receiving a lot of attention. And consequently, as the most effective and efficient way to derive this kind of description, machine learning-based audio and video annotation is highly desired and greatly explored. However, due to the well-known semantic gap between low-level features and high-level semantics, satisfactory annotation performance is difficult to be achieved just by improving the learning algorithms. Thus, it is necessary to make full and effective use of the useful contextual cues underlying the rich content of audio and video, such as semantic correlation, temporal correlation, multi-modal correlation, etc., so as to bridge the semantic gap and enhance the annotation.
     Focusing on context-based audio and video annotation, in this thesis, we analyze the existing problems and conduct a deep research on the exploration, modeling and exploitation of the aforementioned three contextual cues. The main contributions are as follows:
     (1) We propose a new model named Correlated-Aspect Gaussian Mixture Model for multi-label audio concept detection and explore a topic feedback-based keyword spotting method, aiming to utilize semantic correlation-based contextual cues (such as the association between two annotation units) which have been neglected in most cases for boosting audio annotation. Oriented to generic audio data, the former algorithm models concept correlation under the framework of Gaussian Mixture Model, whereby the detection for some concepts that are difficult to be detected is enhanced by those which can be easily detected. While for speech, the latter algorithm exploits topics derived from text categorization to model the original intentions of speakers making the speech and takes them as the high-level semantic context to refine the initial results of keyword spotting. In the application of speech document retrieval, the effectiveness of this algorithm is demonstrated.
     (2) We propose a data-specific two-view concept correlation estimation procedure for video annotation refinement. As the guidance to annotation, concept correlation is crucial. Since the commonly used generic concept correlation which is applied to all data is not practically useful as expected, this procedure focuses on inferring the spatial and temporal concept correlations respectively underlying specific shot and shot pair by formulating this as a problem of data decomposition and reconstruction. In a probability calculation-based video annotation refinement scheme where the derived two types of data-specific correlations are incorporated, experiments on TRECVID2006-2008datasets show that these correlations could well characterize the semantic content of specific data and refine the initial results stemming from individual concept detectors effectively.
     (3) We propose graph regularized probabilistic Latent Semantic Analysis with Gaussian Mixtures (GRGM-pLSA) to deal with the problem of video temporal consistency modeling, and further present a feature conversion algorithm for video concept detection. Originating from pLSA with Gaussian Mixtures (GM-pLSA), GRGM-pLSA employs graph-based manifold regularization to model the neglected intrinsic interdependence between terms. By this means, video temporal consistency, marked by the fact that temporally consecutive video segments usually have similar visual content and express similar semantic meanings, can be modeled in terms of term correlation. Except for feature mapping, GRGM-pLSA is also applied as a generative model. Grounded on the contextual cue underlying video structure, a GRGM-pLSA-based visual-to-textual feature conversion algorithm is proposed, which provides a new perspective of applying probabilistic modeling-based annotation to video. Extensive experiments on YouTube and TRECVID datasets prove the effectiveness of our approaches.
     (4) We propose multi-modal pLSA with Gaussian Mixtures (MMGM-pLSA) as a way of exploiting multi-modal correlation-based contextual cue and extend it to a generalized model-graph regularized MMGM-pLSA (GRMMGM-pLSA). As the multi-modal features extracted from one video segment are correlated with each other, a reasonable multi-modal fusion manner should be capable of maintaining the characteristic of each modality as well as the intrinsic interdependence between them. For this purpose, MMGM-pLSA introduces multiple GMMs with each depicting the feature distribution of each modality, and is used for audio-visual fusion in the task of classification-based video annotation. Furthermore, so as to capture the intrinsic correlation between multi-modal terms, GRMMGM-pLSA, as the generalization of GM-pLSA and our GRGM-pLSA and MMGM-pLSA, is derived, and consequently succeeds in modeling the contextual cues of multiple modalities and temporal consistency simultaneously.
引文
[1]C.E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, vol.27, pp.379-423,623-656,1948.
    [2]http://www.stats.gov.cn/tjsj/ndsj/2012/indexch.htm
    [3]http://www.screendigest.com/
    [4]http://www.kpcb.com/partner/mary-meeker
    [5]A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.22, no.12, pp.1349-80,2000.
    [6]A.G. Hauptmann. Lessons for the future from a decade of informedia video analysis research. ACM Int'l Conf. on Image and Video Retrieval (CIVR), pp.1-10,2005.
    [7]A.K. Dey. Understanding and using context. Personal and Ubiquitous Computing Journal, vol. 5, no.1, pp.4-7,2001.
    [8]韩纪庆,张磊,郑铁然.语音信号处理.北京:清华大学出版社,2004年9月.
    [9]S. Chu, S.S. Narayanan, and C.C.J. Kuo. Environmental sound recognition with time-frequency audio features. IEEE Trans, on Audio, Speech and Language Processing, vol. 17, no.6, pp.1142-1158,2009.
    [10]Y. Shao, Z. Jin, D. Wang, and S. Srinivasan. An auditory-based feature for robust speech recognition. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.4625-4628,2009.
    [11]C.V. Cotton, D.P.W. Ellis, and A.C. Loui. Soundtrack classification by transient events. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.473-476,2011.
    [12]A. Pikrakis, T. Giannakopoulos, and S. Theodoridis. A speech/music discriminator of radio recordings based on dynamic programming and bayesian networks. IEEE Trans, on Multimedia, vol.10, no.5, pp.846-857,2008.
    [13]J. Shirazi and S. Ghaemmaghami. Improvement to speech-music discrimination using sinusoidal model based features. Multimedia Tools and Applications, vol.50, no.2, pp. 415-435,2010.
    [14]L. Lu, H. Zhang, and H. Jiang. Content analysis for audio classification and segmentation. IEEE Trans, on Speech and Audio Processing, vol.10, no.7, pp.504-516,2002.
    [15]T. Butko and C. Nadeu. Audio segmentation of broadcast news in the Albayzin-2010 evaluation:overview, results, and discussion. Eurasip J. Audio, Speech, and Music Processing, vol.2011, no.1, pp.1-10,2011.
    [16]N. Scaringella, G. Zoia, and D. Mlynek. Automatic genre classification of music content-a survey. IEEE Signal Processing Magazine, vol.23, no.2, pp.133-141,2006.
    [17]M. A. Casey, R.Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney. Content-based music information retrieval:current directions and future challenges. Proceedings of the IEEE, vol.96, no.4, pp.668-696,2008.
    [18]G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. Scream and gunshot detection and localization for audio-surveillance systems. IEEE Int'l Conf. on Advanced Video and Signal-Based Surveillance (AVSS), pp.21-26,2007.
    [19]Q. Huang and S.J. Cox. Inferring the structure of a tennis game using audio information. IEEE Trans, on Audio, Speech and Language Processing, vol.19, no.7, pp.1925-1937,2011.
    [20]E. Wold, T. Blum, and J. Wheaton. Content-based classification, search and retrieval of audio. IEEE Multimedia, vol.3, no.3, pp.27-36,1996.
    [21]C. Lin, S. Chen, T. Truong, and Y. Chang. Audio classification and categorization based on wavelets and Support Vector Machine. IEEE Trans. on Speech and Audio Processing, vol.13, no.5, pp.644-651,2005.
    [22]L. Ma, B. Milner, and D. Smith. Acoustic environment classification. ACM Trans, on Speech and Language Processing, vol.3, no.2, pp.1-22,2006.
    [23]W. Choi, S. Kim, M. Keum, D.K. Han, and H. Ko. Acoustic and visual signal based context awareness system for mobile application. IEEE Trans, on Consumer Electronics, vol.57, no. 2, pp.738-746,2011.
    [24]A.J. Eronen, V.T. Peltonen, J.T. Tuomi, A. Klapuri, S. Fagerlund, T. Sorsa, G Lorho, and J. Huopaniemi. Audio-based context recognition. IEEE Trans, on Audio, Speech and Language Processing, vol.14, no.1, pp.321-329,2006.
    [25]K. Lee, D.P.W. Ellis, and A.C. Loui. Detecting local semantic concepts in environmental sounds using Markov model based clustering. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.2278-2281,2010.
    [26]K. Lee and D.P.W. Ellis. Audio-based semantic concept classification for consumer video. IEEE Trans. on Audio, Speech and Language Processing, vol.18, no.6, pp.1406-1416,2010.
    [27]T. Bertin-Mahieux, D. Eck, F. Maillet, and P. Lamere. Autotagger:a model for predicting social tags from acoustic features on large music databases. Journal of New Music Research, vol.37, no.2, pp.115-135,2008.
    [28]Z. Wang, S. Wang, M. He, Z. Liu, and Q. Ji. Emotional tagging of videos by exploring multiple emotions'coexistence. IEEE Int'l Conf. on Automatic Face and Gesture Recognition (FG), pp.1-6,2013.
    [29]W. Chou and B.H. Juang. Pattern recognition in speech and language processing. Portlan: CRC press, pp.191-227,2003.
    [30]J. Baker, L. Deng, J. Glass, S. Khudanpur, C.H. Lee, N. Morgan, and D. O'Shaughnessy. Research developments and directions in speech recognition and understanding, part i. IEEE Signal Processing Magazine, vol.26, no.3, pp.75-80,2009.
    [31]L. Deng and X. Li. Machine learning paradigms for speech recognition:an overview. IEEE Trans. on Audio, Speech and Language Processing, vol.21, no.5, pp.1060-1089,2013.
    [32]SPHINX:http://www.speech.cs.cmu.edu/sphinx/tutorial.html
    [33]Hidden Markov Model Toolkit (HTK):http://htk.eng.cam.ac.uk/
    [34]J.S. Bridle. An efficient elastic-template method for detecting given words in running speech. British Acoustical Society Meeting, pp.1-4,1973.
    [35]Higgins and R. Wohlford. Keyword recognition using template concatenation. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol.10, pp.1233-1236,1985.
    [36]J.G. Wilpon, L.R. Rabiner, and C.H. Lee. Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans, on Acoustics, Speech and Signal Processing, vol.38, no.11, pp.1870-1878,1990.
    [37]NIST, Trec video retrieval evaluation (trecvid):http://www-nlpir.nist.gov/projects/trecvid/
    [38]D.A. James and S.J. Young. A fast lattice-based approach to vocabulary independent word spotting. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 377-380,1994.
    [39]R.C. Rose. Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition. Computer Speech and Language, vol.9, no.4, pp. 309-333,1995.
    [40]J.H.L. Hansen, R. Huang, B. Zhou, M.S. Seadle, J.R. Deller, A. Gurijala, M. Kurimo, and P. Angkititrakul. SpeechFind:advances in spoken document retrieval for a national gallery of the spoken word. IEEE Trans.on Audio, Speech and Language Processing, vol.13, no.5, pp. 712-730,2005.
    [41]J. Mamou, B. Ramabhadran, and O. Siohan. Vocabulary independent spoken term detection. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), pp. 615-622,2007.
    [42]陈一宁.连续语音流中大词表关键词检测算法的研究[博士学位论文].北京:清华大学,2004.
    [43]罗骏,欧智坚,王作英.基于拼音图的两阶段关键词检索系统.清华大学学报,45(10):1356-1359,2005.
    [44]R. Wallace, R. Vogt, and S. Sridharan. A phonetic search approach to the 2006 NIST spoken term detection evaluation.8th Annual Conf. of the Int.l Speech Communication Association (INTERSPEECH), pp.2385-2388,2007.
    [45]L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition:word error minimization and other applications of confusion networks. Computer Speech and Language, vol.14, no.4, pp.373-400,2000.
    [46]E. Lleida and R.C. Rose. Utterance verification in continuous speech recognition:decoding and training procedures. IEEE Trans. on Speech and Audio Processing, vol.8, no.2, pp. 126-139,2000.
    [47]H. Jiang. Confidence measures for speech recognition:a survey. Speech Communication, vol. 45, no.4, pp.455-470,2005.
    [48]F. Wessel, R. Schluter, and K. Macherey. Confidence measures for large vocabulary continuous speech recognition. IEEE Trans, on Speech and Audio Processing, vol.9, no.3, pp. 288-298,2001.
    [49]R. Sarikaya, Yuqing Gao, and M. Picheny. Semantic confidence measurement for spoken dialog systems. IEEE Trans, on Speech and Audio Processing, vol.13, no.4, pp.534-545, 2005.
    [50]国玉晶,刘刚,刘健.基于环境特征的语音识别置信度研究.清华大学学报(自然科学版),1(49):26-31,2009.
    [51]阮秋琦.数字图像处理学.北京:电子工业出版社,2004年7月.
    [52]W. Hu, N. Xie, L. Li, X. Zeng, and S.J. Maybank.A survey on visual content-based video indexing and retrieval. IEEE Trans, on Systems, Man, and Cybernetics, Part C, vol.41, no.6, pp.797-819,2011.
    [53]Y. Rui, T.S. Huang, and S. Mehrotra. Constructing table-of-content for videos. ACM J. Multimedia Systems, vol.7, no.5, pp.359-368,1999.
    [54]M.R. Naphade and J.R. Smith. On the detection of semantic concepts at TRECVID. ACM Multimedia, pp.660-667,2004.
    [55]Yanagawa, S.F. Chang, L. Kennedy, and W. Hsu. Columbia University's baseline detectors for 374 LSCOM semantic visual concepts. Columbia University ADVENT Technical Report #222-2006-8,2007.
    [56]L. Xie, S. Chang, A. Divakaran, and H. Sun. Structure analysis of soccer video with hidden Markov models. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 4096-4099,2002.
    [57]S. Ebadollahi, L. Xie, S.F. Chang, and J.R. Smith. Visual event detection using multi-dimensional doncept dynamics. IEEE Int'l Conf. on Multimedia and Expo (ICME), pp. 881-884,2006.
    [58]R. Yan and M.R. Naphade. Semi-supervised cross feature learning for semantic concept detection in videos. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp.657-663,2005.
    [59]R. Ewerth and B. Freisleben. Semi-supervised learning for semantic video retrieval. ACM Int'l Conf. on Image and Video Retrieval (CIVR), pp.154-161,2007.
    [60]F. Wu, Y. Liu, and Y. Zhuang. Tensor-based transductive learning for multimodality video semantic concept detection. IEEE Trans, on Multimedia, vol.11, no.3, pp.868-878,2009.
    [61]J. Tang, X.-S. Hua, T. Mei, G.-J. Qi, and X. Wu. Video annotation based on temporally consistent Gaussian random field. Electronics Letters, vol.43, no.8, pp.448-449,2007.
    [62]X. Hua, R. Hong, J. Tang, G. Qi, and Y. Song. Unified video annotation via multigraph learning. IEEE Trans. on Circuits and Systems for Video Technology, vol.19, no.5, pp. 733-746,2009.
    [63]S.C.H. Hoi and M.R. Lyu. A semi-supervised active learning framework for image retrieval. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp.302-309,2005.
    [64]J. Yang, R. Yan, and A.G. Hauptmann. Cross-domain video concept detection using adaptive svms. ACM Multimedia, pp.188-197,2007.
    [65]F. Kang, R. Jin, and R. Sukthankar. Correlated label propagation with application to multi-label learning. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), vol.2, pp.1719-1726,2006.
    [66]G. Qi, X. Hua, Y. Rui, J. Tang, T. Mei, M. Wang, and H. Zhang. Correlative multi-label video annotation. ACM Multimedia, pp.17-26,2007.
    [67]Y. Li, Y. Tian, L. Duan, J. Yang, T. Huang, and W. Gao. Sequence multi-labeling:a unified video annotation scheme with spatial and temporal context. IEEE Trans, on Multimedia, vol. 12, no.8, pp.814-828,2010.
    [68]Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. Int'l Workshop on Multimedia Intelligent Storage and Retrieval Management (MISRM),1999.
    [69]P. Duygulu, K. Barnard, J.F.G. de Freitas, and D.A. Forsyth. Object recognition as machine translation:learning a lexicon for a fixed image vocabulary. European Conf. on Computer Vision (ECCV), pp.97-112,2002.
    [70]M. Wang, X. Zhou, and T. Chua. Automatic image annotation via local multi-label classification. ACM Int'l Conf. on Image and Video Retrieval (CIVR), pp.17-26,2008.
    [71]J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. ACM Special Interest Group on Information Retrieval (SIGIR), pp.119-126,2003.
    [72]V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. Advances in Neural Information Processing Systems (NIPS), pp.553-560,2003.
    [73]V. Lavrenko, S. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol.3, pp. iii-1044-7,2004.
    [74]S. Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevance models for image and video annotation. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1002-1009,2004.
    [75]J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, and S. Ma. Dual cross-media relevance model for image annotation. ACM Multimedia, pp.605-614,2007.
    [76]D.M. Blei and M.I. Jordan. Modeling annotated data. ACM Special Interest Group on Information Retrieval (SIGIR), pp.127-134,2003.
    [77]F. Monay and D. Gatica-Perez. On image auto-annotation with latent space models. ACM Multimedia, pp.275-278,2003.
    [78]F. Monay and D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEE Trans, on Pattern Analysis and Machine Intelligence, vol.29, no.10, pp.1802-1817, 2007.
    [79]Z. Li, Z. Shi, X. Liu, Z. Li, and Z. Shi. Fusing semantic aspects for image annotation and retrieval. J. Visual Communication and Image Representation, vol.21, no.8, pp.798-805, 2010.
    [80]D.M. Blei. Probabilistic topic models. Communications of the ACM, vol.55, no.4, pp.77-84, 2012.
    [81]S. Chang, R. Manmatha, and T. Chua. Combining text and audio-visual features in video indexing. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol.5, pp. 1005-1008,2005.
    [82]K. Barnard and D.A. Forsyth. Learning the semantics of words and pictures. IEEE Int'l Conf. on Computer Vision (ICCV), pp.408-415,2001.
    [83]E.Y. Chang, K. Goh, G. Sychay, and G. Wu. CBSA:content-based soft annotation for multimodal image retrieval using Bayes point machines. IEEE Trans, on Circuits and Systems for Video Technology, vol.13, no.1, pp.26-38,2003.
    [84]C. Cusano, G Ciocca, and R. Schettini. Image annotation using SVM. Internet Imaging V, SPIE 5304, Vol.5304, pp.330-338,2003.
    [85]Y. Gao, J. Fan, X. Xue, and R. Jain.Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers. ACM Multimedia, pp.901-910,2006.
    [86]LSCOM Lexicon Definitions and Annotations Version 1.0. DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-3, March 2006.
    [87]Y. Jiang, J. Yang, C. Ngo, and A.G. Hauptmann. Representations of keypoint-based semantic concept detection:a comprehensive study. IEEE Trans. on Multimedia, vol.12, no.1, pp. 42-53,2010.
    [88]Y. Jiang, A. Yanagawa, S.F. Chang, and C. Ngo. CU-VIREO374:fusing Columbia374 and VIREO374 for large scale semantic concept detection. Columbia University ADVENT Technical Report #223-2008-1,2008.
    [89]Y. Jiang, Q. Dai, J. Wang, C. Ngo, X. Xue, and S. Chang. Fast semantic diffusion for large-scale context-based image and video annotation. IEEE Trans, on Image Processing, vol. 21, no.6, pp.3080-3091,2012.
    [90]X. Wei, Y. Jiang, and C. Ngo. Exploring inter-concept relationship with context space for semantic video indexing. ACM Int'l Conf. on Image and Video Retrieval (CIVR), pp.1-8, 2009.
    [91]M. Weng and Y. Chuang. Multi-cue fusion for semantic video indexing. ACM Multimedia, pp. 71-80,2008.
    [92]J. Yi, Y. Peng, and J. Xiao. Mining concept relationship in temporal context for effective video annotation. ACM Multimedia, pp.1053-1056,2011.
    [93]J. He, M. Li, H. Zhang, H. Tong, and C. Zhang. Generalized manifold-ranking-based image retrieval. IEEE Trans. on Image Processing, vol.15, no.10, pp.3170-3177,2006.
    [94]X. Yuan, X. Hua, M. Wang, and X. Wu. Manifold-ranking based video concept detection on large database and feature pool. ACM Multimedia, pp.623-626,2006.
    [95]Z. Lu, H.H. Ip, and Q. He. Context-based multi-label image annotation. ACM Int'l Conf. on Image and Video Retrieval (CIVR), no.30,2009.
    [96]Y. Han, F. Wu, Q. Tian, and Y. Zhuang. Image annotation by input-output structural grouping sparsity. IEEE Trans, on Image Processing, vol.21, no.6, pp.3066-3079,2012.
    [97]Z. Rasheed and M. Shah. Movie genre classification by exploiting audio-visual features of previews. Int'l Conf. on Pattern Recognition (ICPR), pp.1086-1089,2002.
    [98]Y. Ma and H. Zhang. Motion texture:a new motion based video representation. Int'l Conf. on Pattern Recognition (ICPR), pp.548-551,2002.
    [99]Laptev. On space-time interest points. Int. J. Computer Vision, vol.64, no.2-3, pp.107-123, 2005.
    [100]Ta, C. Wolf, G Lavoue, A. Baskurt, and J. Jolion. Pairwise features for human action recognition. Int'l Conf. on Pattern Recognition (ICPR), pp.3224-3227,2010.
    [101]Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp.1-8, 2008.
    [102]J. Sun, X. Wu, S. Yan, L.F. Cheong, T. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp.2004-2011,2009.
    [103]W. Jiang. Advanced techniques for semantic concept detection in general videos [Dissertation]. Columbia University,2010.
    [104]J. Yang and A.G Hauptmann. Exploring temporal consistency for video analysis and retrieval. ACM Int'l Conf. on Multimedia Retrieval (MIR), pp.33-42,2006.
    [105]D. Xu and S. Chang. Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.30, no.11, pp. 1985-1997,2008.
    [106]G. Qi, X. Hua, Y. Rui, J. Tang, T. Mei, M. Wang, and H. Zhang. Correlative multilabel video annotation with temporal kernels. ACM Trans, on Multimedia Computing, Communications, and Applications, vol.5, no.1, pp.1-27,2008.
    [107]Y. Wu, B.L. Tseng, and J.R. Smith. Ontology-based multi-classification learning for video concept detection. IEEE Int'l Conf. on Multimedia and Expo (ICME), pp.1003-1006,2004.
    [108]K. Liu, M. Weng, C. Tseng, Y. Chuang, and M. Chen. Association and temporal rule mining for post-filtering of semantic concept detection in video. IEEE Trans. on Multimedia, vol.10, no.2, pp.240-251,2008.
    [109]R. Yan, M. Chen, and A. Hauptmann. Mining relationship between video concepts using probabilistic graphical models. IEEE Int'l Conf. on Multimedia and Expo (ICME), pp. 301-304,2006.
    [110]J. Yi, Y. Peng, and J. Xiao. Refining video annotation by exploiting inter-shot context. ACM Multimedia, pp.1103-1106,2010.
    [111]W. Jiang, S. Chang, and A.C. Loui. Context-based concept fusion with boosted conditional random fields. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 949-952,2007.
    [112]Y. Zheng, R. Wei, H. Lu, and X. Xue. Semantic video indexing by fusing explicit and implicit context spaces. ACM Multimedia, pp.967-970,2010.
    [113]S. Paek and S.F. Chang. Experiments in constructing belief networks for image classification systems. IEEE Int'l Conf. on Image Processing (ICIP), pp.46-49,2000.
    [114]M.R. Naphade, I. Kozintsev, and T.S. Huang. Factor graph framework for semantic video indexing. IEEE Trans, on Circuits and Systems for Video Technology, vol.12, no.1, pp. 40-52,2002.
    [115]J.R. Smith and M. Naphade. Multimedia semantic indexing using model vectors. IEEE Int'l Conf. on Multimedia and Expo (ICME), pp.445-448,2003.
    [116]W. Jiang, S.F. Chang, and A. Loui. Active concept-based concept fusion with partial user labels. IEEE Int'l Conf. on Image Processing (ICIP), pp.2917-2920,2006.
    [117]T. Giannakopoulos, A. Makris, D.I. Kosmopoulos, S.J. Perantonis, and S. Theodoridis. Audio-visual fusion for detecting violent scenes in videos. Artificial Intelligence:Theories, Models and Applications, Lecture Notes in Computer Science, vol.6040, pp.91-100,2010.
    [118]W. Jiang, C.V. Cotton, S. Chang, D. Ellis, and A.C. Loui. Short-term audio-visual atoms for generic video concept classification. ACM Multimedia, pp.5-14,2009.
    [119]N. Babaguchi, Y. Kawai, and T. Kitahashi. Event based indexing of broadcasted sports video by intermodal collaboration. IEEE Trans, on Multimedia, vol.4, no.1, pp.68-75,2002.
    [120]VS. Tseng, J.H. Su, J.H. Huang, and C.J. Chen. Integrated mining of visual features, speech features, and frequent patterns for semantic video annotation. IEEE Trans, on Multimedia, vol. 10, no.2, pp.260-267,2008.
    [121]C.G. Snoek, M. Worring, and A.W. Smeulders. Early versus late fusion in semantic video analysis. ACM Multimedia, pp.399-402,2005.
    [122]C. Snoek M. Worring, J.M. Geusebroek, D.C. Koelma, and F.J. Seinstra. The MediaMill TRECVID 2004 semantic video search engine. TRECVID Workshop,2004.
    [123]W.H. Hsu, L.S. Kennedy, and S. Chang. Video search reranking via information bottleneck principle. ACM Multimedia, pp.35-44,2006.
    [124]T. Westerveld, A.P.D. Vries, A.V. Ballegooij, F.D. Jong, and D. Hiemstra. A probabilistic multimedia retrieval model and its evaluation. EURASIP J. Advances in Signal Processing, vol.2003, no.2, pp.186-198,2003.
    [125]D.L. Hall and J. Llinas. An introduction to multisensor data fusion. Proceedings of the IEEE, vol.85, no.1, pp.6-23,1997.
    [126]P.K. Atrey, M.A. Hossain, A. El-Saddik, and M.S. Kankanhalli. Multimodal fusion for multimedia analysis:a survey. Multimedia System, vol.16, pp.345-379,2010.
    [127]W.H. Lin and A. Hauptmann. News video classification using svm-based multimodal classifiers and combination strategies. ACM Multimedia, pp.323-326,2002.
    [128]H. Xu and T. Chua. Fusion of AV features and external information sources for event detection in team sports video. ACM Trans, on Multimedia Computing, Communications, and Applications, vol.2, no.1, pp.44-67,2006.
    [129]J. Magalhaes and S.M. Ruger. An information-theoretic framework for semantic-multimedia retrieval. ACM Trans, on Information Systems, vol.28, no.4, pp.19-19,2010.
    [130]Y. Wu, C. Lin, E.Y. Chang, and J.R. Smith. Multimodal information fusion for video concept detection. IEEE Int'l Conf. on Image Processing (ICIP), pp.2391-2394,2004.
    [131]R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, vol.6, no.3, pp.21-45,2006.
    [132]P. Ahrendt, J. Larsen, and C. Goutte. Co-occurrence models in music genre classification. IEEE Workshop on Machine Learning for Signal Processing (MLSP), pp.247-252,2005.
    [133]A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. the Royal Statistical Society, Series B, vol.39, no.1, pp.1-38,1977.
    [134]http://wordnet.princeton.edu/.
    [135]H.S.Seung and D.D.Lee. Cognition-The manifold ways of Perception. Science, vol.290, no. 5500, pp.2268-2269,2000.
    [136]J.B. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol.290, no.12, pp.2319-2323,2000.
    [137]M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems (NIPS), pp.585-591,2001.
    [138]M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization:a geometric framework for learning from labeled and unlabeled examples. J. Machine Learning Research, vol.7, pp. 2399-2434,2006.
    [139]S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, vol.290, no.5500, pp.2323-2326,2000.
    [140]J. Liu, D. Cai, and X. He. Gaussian Mixture Model with local consistency. AAAI Conf. on Artificial Intelligence (AAAI), pp.512-517,2010.
    [141]X. Zhu, Z. Ghahramani, and J.D. Lafferty. Semi-supervised learning using Gaussian Fields and Harmonic functions. Int'l Conf. on Machine Learning (ICML), pp.912-919,2003.
    [142]Y. Shao, Y. Zhou, X. He, D. Cai, and H. Bao. Semi-supervised topic modeling for image annotation. ACM Multimedia, pp.521-524,2009.
    [143]Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. Int'l World Wide Web Conf. (WWW), pp.101-110,2008.
    [144]R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, pp.355-368,1998.
    [145]D. Cai, Q. Mei, and J. Han. Modeling hidden topics on document manifold. ACM Int'l Conf. on Information and Knowledge Management (CIKM), pp.911-920,2008.
    [146]C.C. Tan, Y.G Jiang, and C. Ngo. Towards textually describing complex video contents with audio-visual concept classifiers. ACM Multimedia, pp.655-658,2011.
    [147]TREC-10 proceedings appendix on common evaluation measures: http://trec.nist.gov/pubs/trec 10/appendices/measures.pdf.
    [148]章洁.大词表自然语音关键词识别系统的研究与实现[硕士学位论文].北京:北京交通大学,2009.
    [149]G Salton, A. Wong, and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, vol.18 no.11, pp.613-620,1975.
    [150]G Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, vol.24, no.5, pp.513-523,1988.
    [151]D.L. Olson and D. Delen. Advanced Data Mining Techniques. Springer,1st edition, pp.138, 2008.
    [152]J. Liu, M. Li, W.Y. Ma, Q. Liu, and H. Liu. An adaptive graph model for automatic image. ACM Int'l Conf. on Multimedia Retrieval (MIR), pp.61-70,2006.
    [153]GH. Golub and C.F.V. Loan. Matrix computations. Johns Hopkins University Press,3rd edition,1996.
    [154]J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans, on Pattern Analysis and Machine Intelligence, vol.31, no.2, pp. 210-227,2009.
    [155]J. Mairal, F. Bach, J. Ponce, G Sapiro, and A. Zisserman. Non-local sparse models for image restoration. IEEE Int'l Conf. on Computer Vision (ICCV), pp.2272-2279,2009.
    [156]J. Mairal, F. Bach, J. Ponce, and G Sapiro. Online dictionary learning for sparse coding. Int'l Conf. on Machine Learning (ICML), pp.87-696,2009.
    [157]S. Gao, I.W. Tsang, L. Chia, and P. Zhao. Local features are not lonely-Laplacian sparse coding for image classification. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp.3555-3561,2010.
    [158]E. Yilmaz and J.A. Aslam. Estimating average precision with incomplete and imperfect judgments. ACM Int'l Conf. on Information and Knowledge Management (CIKM), pp. 102-111,2006.
    [159]T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, vol.42, no.1, pp.177-196,2001.
    [160]E. Horster, R. Lienhart, and M. Slaney. Continuous visual vocabulary models for plsa-based scene recognition. ACM Int'l Conf. on Image and Video Retrieval (CTVR), pp.319-328, 2008.
    [161]F. Li, P. Perona, and C.I.O. Technology. A Bayesian hierarchical model for learning natural scene categories. IEEE Int'l Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 524-531,2005.
    [162]R. Bekkerman and J. Allan. Using bigrams in text categorization. CIIR Technical Report IR-408,2004.
    [163]B. Chen. Word topic models for spoken document retrieval and transcription. ACM Trans, on Asian Language Information Processing, vol.8, no.1, pp.1-27,2009.
    [164]J. Zhang and S. Gong. Action categorization by structural probabilistic latent semantic analysis. Computer Vision and Image Understanding, vol.114, no.8, pp.857-864,2010.
    [165]D. Cai, X. He, J. Han, and T.S. Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans, on Pattern Analysis and Machine Intelligence, vol.33, no.8, pp.1548-1560,2011.
    [166]Ulges, C. Schulze, D. Keysers, and T.M. Breuel. A system that learns to tag videos by watching Youtube. Int'l Conf. on Computer Vision Systems (ICVS), pp.415-424,2008.
    [167]Yanagawa, W. Hsu, and S. Chang. Brief descriptions of visual features for baseline TRECVTD concept detectors. Columbia University ADVENT Technical Report#219-2006-5,2006.
    [168]Z. Li, Z. Shi, X. Liu, and Z. Shi. Modeling continuous visual features for semantic image annotation and retrieval. Pattern Recognition Letters, vol.32, no.3, pp.516-523,2011.
    [169]R. Zhang, L. Zhang, X. Wang, and L. Guan. Multi-feature plsa for combining visual features in image annotation. ACM Multimedia, pp.1513-1516,2011.
    [170]Z. Lu, Y. Peng, and H.H. Ip. Image categorization via robust plsa. Pattern Recognition Letters, vol.31, no.1, pp.36-43,2010.
    [171]K.E.A.V.D. Sande, T. Gevers, and C.GM. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Trans, on Pattern Analysis and Machine Intelligence, vol.32, no.9, pp.1582-1596,2010.
    [172]S. Young, G Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book. Cambridge Univiversity, 1999.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700