基于音频词袋和MPEG-7特征的暴力视频快速分类算法研究

英文题名：Research on Violent Video Detection Algorithm Based on Bag of Audio Words and MPEG-7 Features
作者：李荣杰
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：视频分类 ; 词袋模型 ; 过滤筛选 ; 支持向量机 ; MPEG-7音频特征
英文关键词：video classification ; bag of words ; filtering model ; support vector machine ; MPEG-7 audio feature
学位年度：2010
导师：蒋兴浩
学科代码：081001
学位授予单位：上海交通大学
论文提交日期：2010-12-01
答辩委员会主席：孙锬锋

摘要

随着网络视频的普及与流行,互联网上存在着各类视频。近些年,计算机视觉越来越得到关注,通过分析计算机中的二进制数据,可以区分每个视频的所属类别。传统的基于内容的视频分类技术主要分为视频和音频特征提取两部分,视频特征主要提取图像的全局特征如颜色、纹理、形状等,并比较这些视觉特征间的相似性,从而自动搜索出符合用户要求的图像。而音频特征主要提取音频流的音频特征,如基音频率带宽、频谱流量、Mel倒谱系数、声音功率等。这些视频和音频特征通过分类器训练学习后,能够对视频类别有较为准确的识别。
     另一方面,由于网络上充斥着各类不健康的视频,尤其是其中的一些恐怖和暴力视频对于儿童的发展是有比较大的危害,需要对这些视频进行标注和监管。近年来,对于网络视频的监管需求越来越高。针对以上需求,本文提出两种针对暴力视频的分类方法。
     本文介绍了一种结合MPEG-7音频特征和词袋模型的―音频词袋‖特征。首先,提取网络视频的音频流,对其提取MPEG-7音频特征,通过对音频签名特征的分类和聚类,构造属于暴力场景特有的―音频词汇‖,通过特有的权重分配机制,获得新的―音频词袋‖特征。通过实验,本方法有不错的查全率,可以应用到网络视频的实时监控上。
     本文还通过视音频特征结合,提出了两种针对暴力视频特有的筛选模型,分别为结构张量筛选模型以及音频快速筛选模型。结构张量筛选模型是通过对视频进行结构张量特征(一种运动检测特征)过滤,得到运动比较激烈的画面,然后进行人脸检测及音频场景匹配。音频快速筛选模型是先提取音频特征进行常见暴力场景的匹配,对得到的候选镜头进行图像特征的精确分类。通过实验,音频快速筛选模型在分类速度上快于结构张量模型,而结构张量模型的准确率较高。两者都能比较好的应用于网络暴力视频的过滤中。
With the flourishing of the moving industry and development of the multimedia, many types of movies are available through the internet. We can easily differentiate among different genres of movies after watching them. However, for the computer, it is a quite complicate work to automatically recognize the theme of various types of the movies. Recent years, more and more attention is paid on the computer vision research area. The computer can make difference of the types of the video by compare the binary data of the video and audio features. The traditional content based video classification mainly includes two parts: the audio features and video features. The visual features include the color, texture and motion while the audio features mainly include the low level features such as audio bandwidth, frequency and Mell feature.
     On the other hand, there are some films with many violent and horror scenes which are uncomfortable for children to watch. Nowadays, the government pays more attention on the video regulation on the network. For this reason, two methods of classy the violent videos are presented in this paper.
     We first introduced a new method to identify the violent videos by the bag of audio words is introduced. The MPEG-7 audio descriptors are firstly extracted, including the low level features such as AudioSpectrumCentroid and AudioSpectrum-Spread, etc. The audio words are then built according to the MPEG-7 high level descriptor, the AudioSighnature, which is considered as the―fingerprint‖of the audio stream. The support vector machine is used to classify the feature vectors into two classes, i.e. the violent and non-violent videos. The experiment results demonstrate that our method can achieve good recall accuracy.
     Combined with the video features, two filtering models are introduced later, which are the visual structure tensor filtering model and fast audio filtering model. In the structure tensor model, we first extract the structure tensor features and then classify the candidate shots by face detection and violent audio event detection. While in the fast audio model, we extract the audio features first and classify the candidate shots by visual features. The experiment results show the visual structure tensor model shows high classification accuracy while the audio model performs higher speed. Both of the models can be applied in the violent vide filtering on the internet.

引文

[1] M. Stricker and M. Orengo,―Similarity of color images,‖SPIE Storage and Retrieval for Image and Video Databases III, vol. 2185, pp.381-392, Feb. 1995.
    [2] John R. Smith and Shih-Fu Chang. Tools and techniques for color image retrieval. In Proc. of SPIE: Storage and Retrieval for Image and Video Database.vol 2670, 1995.
    [3] G. Pass and R. Zabih,―Histogram refinement for content-based image retrieval,‖IEEE Workshop on Applications of Computer Vision, pp. 96-102, 1996.
    [4] Wardhani,. Thornson.Content Based Image Rertieval Using Categoyr Based Indexing In:IEEE Intemational Conefneree on Multimedian and ExPo(ICME,2004),Rohcesze,NY USA,2004.
    [5] R.Zhao,w.l,Gorsky. Bridging Semantic Pain Image Retrieval,Distributed Multimedia Database :Technique sand Publications,pp 14-36,2002.
    [6] M.Szummer and R.W.Pieard indoor-outdoor image classification In IEEE International workshop on Content-based Area of Image and Video Databases,Jan 1998.
    [7] D.A.Fosryth,J.Malik,M.M.Fleek,H.GreenPsna,.TLeung,5.Belongie,C.Casron,C.Casrno , .Bergler. Finding Pictures of objects InLagre Collectinos Of lmages. In Intmeational workshop Object Recognition for computer vision,April 13一14,1996.
    [8] A.B.Torralba,A.Oliva.Semantic Ograniaztion of Scenes Using discriminant structural Templates In IEEE Greece,1999.
    [9] M.M.Gorkani,R.W.picard.Texture orientation For Sorting photos‖at a glance‖.
    [10] H.H.Yu,W.Wolf Scenic Classifieation Mehtods For Image And Video Databases ln Proe.SpIE,Digital lmage Storage and Archiving Systems
    [11] Datta, A., Shah, M., Lobo, N.D.V.: Person-on-Person Violence Detection in Video Data.In: IEEE International Conference on Pattern Recognition, pp. 433–438 (2002)
    [12] Nam, J., Alghoniemy, M., Tewfik, A.H.: Audio-visual content-based violent scene characterization.In: IEEE International Conference on Image Processing, pp. 353–357 (1998)
    [13] Cheng, W., Chu, W., Wu, J.: Semantic context detection based on hierarchical audio models.In: Proceedings of the 5th ACM SIGMM international Workshop on Multimedia information Retrieval, pp. 109–115 (2003)
    [14] Smeaton, A.F., Lehane, B., O’Connor, N.E., Brady, C., Craig, G.: Automatically selecting shots for action movie trailers. In: Proceedings of the 8th ACM international Workshop on Multimedia information Retrieval, pp. 231–238 (2006)
    [15] Smith J R, Chang S F. Single color extraction and image query. Proc.IEEE int.Conf.ImageProcessing, 1995, 10, vol. 3, pp. 528-531.
    [16] Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE Transaction System, Man, Cybern, 1978, 8(6), pp. 460-473.
    [17] Xuelong Hu, Yingcheng Tang, Video object matching based on SFIFT algorithm. IEEE Int. Conference Neural Networks & Signal Processing.
    [18] Vladimir Cherkassky , Filip Mulier, Learning from Data: Concepts, Theory, and Methods (Adaptive and Learning Systems for Signal Processing, Communications and Control Series)
    [19] Vanpik Vladimir N.. The Nature of Stastistical Learning Theory. Springer Verlag, New York. 1995.
    [20] Asako Koike and Toshihisa Takagi, Classifying biomedical figures using combination of bag of keypoints and bag of words. International Conference on Complex, Intelligent and Software Intensive Systems,2009, pp.848-853
    [21] Lior Weizman and Jacob Goldberger, Detection of Urban Zones in Satellite Images Using Visual Words, 2008 IEEE, vol(5), pp 160-163
    [22] MacKay, David (2003). "Chapter 20. An Example Inference Task: Clustering". InformationTheory, Inference and Learning Algorithms. Cambridge University Press. pp. 284–292.
    [23] Yu Gong, Weiqiang Wang, Detecting Violent Scenes in Movies by Auditory and Visual Cues, PCM 2008, pp 317-326
    [24] Weickert J.Scale-space properties of nonlinear diffusion filtering with a diffusiontensor[R].Technical Report, Laboratory of Techno mathematics,University of Kaiserslautem,Germany,1994.
    [25] Weickert J.Anisotropic diffusion in image processing[D].Kaiserslautem:University of Kaiserslautern,1996.
    [26] Weickert J.Coherence-enhancing diffusion filtering[J].International Journal ofComputer Vision , 1999.31(2/3):11 1—127.
    [27] HauBeeker H.Jahne B A tensor approach for local structure anal-ysis in multi dimensional images 1996
    [28] Jahne B.Hauβecker H.seharr H Study of dynamical processes with tensor-based spatio temporal image processing techniques 1998
    [29] Ngn C W.Pong T C.Zhang H J Motion analysis and segmentation through spatio-temporal slices 2003
    [30] Kuhne G.Weickert J.sehuster O A tensor-driven active contour model for moving object segmentation 2001
    [31] Chellappa R.Wilson C.Sirohey S Human and machine recognition of faces:A survey 1995
    [32] Zhao W.Chellappa R.Rosenfeld A.Phillips P J Face recognition:A literature survey 2003(04)
    [33] Li S Z.Jain A K, Handbook of Face Recognition 2005, New York. ISBN# 0-387-40595-x. 16 Chapters. 400 pages.
    [34] Zhou S.Chellappa R Beyond a single still image : Face recognition from multiple still images and videos 2005
    [35] Shakhnarovich G.Fisher J W.Darrell T Face recognition from long-term observations 2002
    [36] Liu X M.Chen T.Thornton S M Eigenspace updating for non-stationary process and its application to face recognition 2003(09)
    [37] Liu X M.Chen T Video-based face recognition using adaptive hidden Markov models 2003
    [38] Lee K C.Ho J.Yang M H.Kriegman D Video-based face recognition using probabilistic appearance manifolds 2003
    [39] Yah Y.Zhang Y J State-of-the-art on video-based face recognition 2008
    [40] KittlerJ.Hatef M.Duin R P W.Matas J On combining classifiers 1998(03)
    [41] Arandjelovi(c) O.Shakhnarovich G.Fisher G.Cipolla R,Darrell T Face recognition with image sets using manifold density divergence 2005
    [42]肖明霞人脸检测关键算法分析[期刊论文]-科技信息2008(35)
    [43]嵇新浩基于NMF和LVQ神经网络的人脸识别[期刊论文]-微电子学与计算机2009(2)
    [44]储泽楠.闫琰肤色模型的面部肤色区域提取[期刊论文]-电脑编程技巧与维护2009(18)
    [45]金鑫.李晋惠基于神经网络的人脸检测算法研究[期刊论文]-科技信息2008(35)
    [46]甘志英.董树宇人脸检测技术的发展和现状研究[期刊论文]-唐山学院学报2008(6)
    [47]邹北骥.韩立芹.彭小宁一种用于跳水运动视频的全局运动估计方法[期刊论文]-电子学报2008(12)
    [48]袁贞明.卢志平一种基于维度约减的快速人脸检测方法[期刊论文]-杭州师范大学学报(自然科学版) 2009(2)
    [49] .宋义伟.王秀.赵雪竹.朱学峰基于肤色分割和AdaBoost算法的彩色图像的人脸检测[期刊论文]-自动化与信息工程2009(1)
    [50] MPEG-7 Reference Software: eXperimentation Model(XM). http://www.lis.ei.tum.de/research/bv/topics/mmdb/e_mpeg7.html
    [51] Holger Crysandt. MPEG-7 Audio Encoder. http://mpeg7audioenc.sourceforge.net/.
    [52]杨淑莹.模式识别与智能计算:Matlab技术实现.北京,电子工业出版社, 2008.
    [53] Chih-Chung Chang, Chih-Jen Lin. LIBSVM: a library for support vector machine. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
    [54] Rong Zhao,William I.Grosky.Narrowing the Semantic Gap --Improved Text-Based Web Document Retrieval Using Visual Features, IEEE Transactions on multimedia,vol,4,No.2,June 2002.
    [55] R.Zhao and W I.Grosky, Negotiating the Semantic Gap:From Feature Maps to Semantic Landscapes,Pattern Recognition,Jan 2001.
    [56] Hare,J.S.,Sinclair,P.A.S.,Lewis,P.H.,Martinez,K.,Enseg E G B.and Sandom,C.J.,Bridging the Semantic Gap in Multimedia Information Retrieval:Top-down and Bottom-up approaches.In:Mastering the Gap:From lnformation Extraction to Semantic Representation/ 3rd European Semantic Web Conference in Budva,Montenegro.,12 June 2006.
    [57] Zhang Yu-jin.Visual Informsfen Retrieval Based on Content[M].Beijing:Science Press,2003.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700