基于关键帧的视频内容描述方法的研究

英文题名：Description of Video Content Based on Key Frame
作者：张丽坤
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：时空域关注模型 ; 视觉转移 ; 关键帧 ; 事件检测 ; 人脸检测和追踪
英文关键词：spatial-temporal visual attention model ; visual attention shift ; key frame ; event detection ; face detection and tracking
学位年度：2013
导师：孙建德
学科代码：081001
学位授予单位：山东大学
论文提交日期：2013-03-20

摘要

随着多媒体技术的飞速发展,关于如何对视频内容进行描述已成为研究的热点之一,而结合人类视觉系统的视频内容描述方法也越来越受到研究者们的关注,并将其应用于视频检索、智能监控、视频压缩、视频复制检测等领域。同时,随着智能监控系统的广泛应用,对视频内容的描述在监控视频的事件检测方面的应用更加突出。因此,对视频内容的正确描述及在事件检测方面的应用已经成为国内外研究的热点。
     本文是在基于符合人眼视觉注意机制的时空域关注模型的基础上,应用视觉关注转移机制,提出了一种基于关键帧的视频内容描述方法,并将其应用到智能监控视频的事件检测中,并同时对人脸检测和追踪算法进行了研究。本文首先介绍了视频内容描述的基本方法和研究现状；然后介绍了视觉关注模型的基本知识,接着重点介绍了一种新的时空域关注模型的构建过程；并在此基础上根据视觉关注转移机制提取表征视频内容的关键帧；接着将基于关键帧的视频内容描述方法应用于事件检测领域,最后研究了人脸检测和追踪的算法,以便以后将人脸信息作为更高级的语义特征用于关注模型的构建方面。
     本文的主要创新和贡献在以下几个方面：
     (1)构建一种新的时空域关注模型。该模型在实验室原有研究成果的基础上,加上视频的时域信息,用以时域关注为主的权重将时域关注模型和空域关注模型进行融合,构建出符合人眼视觉注意机制的时空域关注模型。
     (2)提出一种基于视觉关注转移的事件检测算法。该算法从人的视觉关注特性出发,将视觉关注的转移作为事件检测的依据,根据时空域关注模型提取视频帧中的受关注区域,根据连续帧中的最受关注区域的变化来确定人眼视觉关注点的转移,形成视频关注节奏,根据关注节奏的变化强度来选取关键帧,通过关键帧表明事件发生的时刻,从而触发对受关注事件的提示,并结合人眼的视觉注意机制选择关键帧中的受关注区域作为对象,用基于meanshift的追踪算法对对象进行追踪,标定受关注对象在前、后续帧中的位置,并对遗留物和被带走物体进行标定和突出显示,从而达到有效遏制危险情况发生的目的。
     (3)提出了一种基于AdaBoost和CAMSHIFT的人脸检测和追踪算法,该算法在现有人脸检测和追踪算法的基础上,对人脸追踪过程进行了改进,提出了用累加直方图作为追踪依据的追踪算法,并在追踪过程中不断调整搜索窗口的位置和大小,追踪过程中出现的肤色跟背景颜色相近或是距离变化而引起人脸范围变化的问题得到了很好的解决。
With the rapid development of multimedia technology, the description of video content has become a hot research topic. Nowadays, the application of Human Visual System (HVS) on describing video content attracts more and more research interests and it is applied widely in video retrieval, intelligent surveillance, video compression, video copy detection and so on. Meanwhile, intelligent surveillance system also has urgent requirements on the description of video content, especially on event detection of surveillance videos. Therefore, how to exactly describe the events in surveillance video is one of the highlights in these related fields.
     In this paper, we proposed a method of description of video content based on key frame, in which we use visual attention shift mechanism based on spatial-temporal visual attention model that meets HSV. It is used in event detection in surveillance video. At the same time, we do some research about the face detection and face tracking. In this paper, we first introduce the state of the art of the description of video content and then introduce the theory of visual attention model. In the following, we emphasize the new method we propose spatial-temporal visual attention model. Based on the spatial-temporal visual attention model, we extract the key frames according to the human visual attention mechanism to describe the video content and apply them to the event detection of surveillance videos. Finally, we research for the face detection and face tracking so that we can put faces as high level feature to improve our spatial-temporal visual attention model in the future research.
     The main innovations and contributions in this paper are as follows:
     (1) Form a new spatial-temporal visual attention model. The model adds the
     temporal information of video based on our lab's results. The temporal and spatial
     visual attention models are fused by the weight which is determined by temporal
     attention model to form the final spatial-temporal visual attention model that meets
     (2) We propose a visual attention shift-based event detection algorithm for intelligent surveillance, in which the temporal and spatial visual attention regions are detected to obtain the visual saliency map, and then the visual attention rhythm is derived from the visual saliency map temporally. According to the visual attention rhythm, the key frames are selected out to label the occurrence of the events. At the same time, the likely to be concerned objects in the key frames are exacted and tracked in the former and latter frames.
     (3) A face detection and tracking algorithm based on AdaBoost and CAMSHIFT is proposed. The algorithm improves the face tracking, in which uses accumulating histogram as the evidence of tracking and constantly change the size and position of the target window. We resolve the problem that if the color of faces is similar to the background color, the faces are easy to lose tracking.

引文

[1]B. T.Truong and S.Venkatesh. Video Abstraction:A Systematic Review and Classification[J], ACM Transactions on Multimedia Computing,3(1),2007.
    [2]张丽坤,孙建德,李静,基于视觉关注转移的事件检测算法[J],智能系统学报,7(4),pp.333-338,2012.
    [3]J. Glaister, C. Chan and M. Frankovich, Hybrind Video Compression Using Selective Keyframe Identification and Patch-Based Super-Resolution [C], Multimedia (ISM), pp.105-110,2011.
    [4]A. Joly, C. Frelicot and O.r Buisson, Robust Content-Based Video Copy Identification in a Large Reference Database [J], Image and Video Retrieval Lecture Notes in Computer Science, Vvolume 2728, pp.414-424,2003.
    [5]C De Roover, C De Vleeschouwer, Robust video hashing based on radial projections of key frames [J],, IEEE Transactions on Signal Processing,53(10), pp.4020-4037.2005.
    [6]R. Mohan, Video sequence matching[C], ICASSP 1998, pp.3697-3700,1998.
    [7]A. Joly, O. Buisson and C. Frelicot, Content-based copy retrieval using distortion-based probabilistic similarity search [J], IEEE Trans on Multimedia,9(2). pp.293-306,2007.
    [8]S. Lee, C. D. Yoo, Video fingerprinting based on centroids of gradient orientations [C], Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp.401-404,2006.
    [9]牛夏牧,焦玉华,感知哈希综述[M],电子学报,36(7),pp.1405-1411,2008.
    [10]Van Beek, Peter J L, Content Description for Efficient Video Navigation, Browsing and Personalization [C], IEEE Workshop on Content-based Access of Image and Video Libraries, pp.40-44,2000.
    [11]Y. Guo, H. Q. Li and F.Z. Wu. Intermediate description for multiple video adaptation [J], IEEE Transactions on Consumer Electronics.55(2), pp.919-916, 2009.
    [12]Y. F. Ma, X. S. Hua, L. Lu, and H. J. Zhang, A generic framework of user attention model and its application in video summarization [J], IEEE transaction on multimedia,7(5), pp.907-915,2005.
    [13]Y. F. Ma, H. J. Zhang, A model of motion attention for video skimming [C], ICIP 2002, pp.129-132,2002.
    [14]Y. F. Ma, L. Lu, H. J. Zhang, et al, A user attention model for video summarization [J], ACM MM, pp.533-542,2002.
    [15]于舟,张瑞,杨小康,基于视觉注意模型和HMM的足球视频语义分析,中国图象图形学报,13(10),pp.2031-2034,2008.
    [16]张菁,沈兰荪,高静静,基于视觉注意模型和进化规划的感兴趣区检测方法,电子与信息学报,31(7),PP.1646-1652,2009.
    [17]蒋鹏,秦小麟,基于视觉注意模型的自适应视频关键帧提取,中国图象图形学报,14(8),PP.1650-1655,2009.
    [18]X. Su, T. J. Huang, and W. Gao, Robust video fingerprinting based on visual attention regions, ICASSP 2009, pp.1525-1528,2009.
    [19]Lai J L, Yi Y. Key frame extraction based on visual attention model[J]. Journal of Visual Communication and Image Representation,23(1), pp.114-125,2012.
    [20]Jiang F, Wu Y. A dynamic hierarchical clustering method for trajectory-based unusual video event detection [J], IEEE Transactions on Image Processing,18(4): pp.907-913,2009.
    [21]王毅,李弼程,彭天强,视频摘要技术[J],信息工程大学学报,10(4),pp.403-497,2009.
    [22]S. Behzad, C. D. Gibbon. Automatic generation of pictorial transcripts of video programs, Proc. SPIE. Pp.512-518,1995.
    [23]A. Divakaran, R. Radhakrishnan and K.A. Peker, Motion activity based extraction of key-frames from video shots [C], Proc. of IEEE ICIP, Vol.1, pp.932-935, 2002.
    [24]Y. Z. Ma, Y. L. Chang and H. Yuan. Key-frame extraction based on motion acceleration [C], Optical Engineering,47(9),2008.
    [25]Y.N. Li, Z.M. Lu, Video Abstraction via Attention Model and On-line Clustering [C], Innovative Computing, Information and Control (ICICIC), pp.627-630.2009.
    [26]Z.B.Chen, F.L.Bao and Z.G.Fang, Key-Frame Extraction Using Kernel-Based Locality Preserving Learning [C], Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), pp.655-658,2010.
    [27]P. Lei; J. X. Wu, Key Frame Extraction Based on Sub-Shot Segmentation and Entropy Computing, Proceedings of the 2009 Chinese Conference on Pattern Recognition, CCPR 2009, and the 1st CJK Joint Workshop on Pattern Recognition, CJKPR,2009.
    [28]Z. H. Sun, K. B. Jia, Video Key Frame Extraction Based on Spatial-Temporal Color Distribution [J], Intelligent Information Hiding and Multimedia Signal Processing,2008.
    [29]M. Wei, H. C. Feng and X. J. Yuan, Key-frames-extraction Based on Image Frame Block [C], Computing and Information Technology, Vol.12,2009.
    [30]Y. Wu. X. P. Jia and H. B. Li, A Method of Key-Frames-Extraction based on Curvature Detection on Similarity Curve of Multi-Feature [C], Computer Applications,28(12). pp.3084-3088,2008.
    [31]H. Zeng, H. H. Yang, A Two phase Video Key frame Extraction Method [C]. Computer and modernization, No.6, pp.33-35,2011.
    [32]A. Amiri. M. Fathy and A. Naseri, Key-frame extraction and video summarization using QR-Decomposition [C], Multimedia Technology and its Applications (IDC). pp.134-139,2010.
    [33]Y. Choi, S Lee. Scalable key frame extraction using one-class support vector machine [C]. ICCS'03 Proceedings of the 2003 international conf, pp.491-499, 2003.
    [34]R. Narasimha. A. Savakis. A neural network approach to key frame extraction [J], Storage and Retrieval Methods and Applications for Multimedia, pp.439-447,2004.
    [35]H. M. Sun, C.W. Huang, The effect of media richness factors on representativeness for video skim [J],69(11), pp.758-768,2011.
    [36]Lemarie, Julie, Eyrolle, Helene, Cellier, Jean-Marie, The segmented presentation of visually structured texts:effects on text comprehen-sion [C]. Computers in Human Behavior 24 (3), pp.888-902,2009.
    [37]Chasanis, V.T., Likas, A.C., Galatsanos, N.P.. Scene detection in videos using shot clustering and sequence alignment [J]. IEEE Transac- tions on Multimedia 11 (1), pp.89-100,2009.
    [38]Ariki, Y., Kumano, M., Tsukada, K.. Highlight scene extraction in real time from baseball live video [J]. Proceedings of the 5th International Workshop on Multimedia Information Retrieval (ACM SIGMM), pp.209-214,2003.
    [39]Taskiran, C.M., Amir, A., Ponceleon, D.B., Delp, E.J.. Automated video summarization using speech transcripts [J]. Proceedings of the SPIE Conference on Storage and Retrieval for Media Databases, vol.4676, pp.371-382,2001.
    [40]Christel, Michael G., Smith, Michael A., Taylor, C. Roy, Winkler, David B. Evolving video skims into useful multimedia abstractions[C], Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.171-178,1998.
    [41]Benini, S. Migliorati, P. Leonardi, R. Hidden Markov models for video skim generation[C], Proceedings of the Eighth International Work- shop on Image Analysis for Multimedia Interactive Services, pp.6-9,2007.
    [42]Yu X D, Wang L, Tian Q, et a.l Mult-i level video rep-resentationwith application to key frame extraction[C], Proceedings of the InternationalConference onMultimedia Modeling (MMM), pp.117-121,2004.
    [43]戎佳维,吴立德.基于镜头间信息的关键帧提取[J],计算机科学,32(12),pp.220-222,2005.
    [44]Y. Rui, T. S. Huang and S. Mehrotra, Exploring Video Structure Beyond the Shots [C], Multimedia Computing and Systems, pp.237-240,1998.
    [45]ZHAO, L., QI, W., LI, S., S.Q.YANG, AND ZHANG, H. A new content-based shot retrieval approach:Key-frame extraction based nearest feature line (NFL) classification [C], Proceedings of the ACM Multimedia Information Retrieval Conference (Los Angeles, CA), pp.217-220,2000.
    [46]Yingbo Li and Bernard Merialdo, Evaluation of Video Summaries [C], Content-Based Multimedia Indexing (CBMI), pp.1-4,2010.
    [47]L. Itti, C. Koch and E. Niebur. A model of saliency-based visual attention for rapid scene analysis[J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 20(11), pp.1254-1259.1998.
    [48]U. Rutishauser, D. Walther, C. Koch and P Perona. Is Bottom-Up attention useful for object recognition[C]. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2004.
    [49]T. Liu, Z. Yuan, J. Sun, J.Wang, N. Zheng, T. X., and S. H.Y. Learning to detect a salient object. IEEE TPAMI,33(2), pp.353-367,2011.
    [50]D. S. Gao, V. Mahadevan and N. Vasconcelos. The discriminant center-surround hypothesis for bottom-up saliency. In Advances in Neural Information Processing Systems,2007.
    [51]Zhang J, Sun J D, Yan H, et al. Visual attention model with cross-layer saliency optimization[C], IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp.240-243,2011.
    [52]A.L. Yarbus, Eye-Movements and Vision. Plenum Press,1967.
    [53]Y. F. Ma, X. S. Hua, L. Lu and H.-J. Zhang, A generic framework of user attention model and its application in video summarization [J], IEEE transaction on multimedia,7(5), pp.907-915,2005.
    [54]Navalpakkam, V., Itti. L.:Modeling the influence of task on attention [C]. Vis. Res. 45, pp.205-231,2005.
    [55]http://ilab.usc.edu/
    [56]K. Rapantzikos, N. Tsapatsoulis, Y. Avrithis. et al. Bottom-up spatiotemporal visual attention model for video analysis. IET Image Processing,1(2), pp.237-248,2007.
    [57]Y. Zhai and M. Shah. Visual attention detection in video sequences using spatiotemporal cues [C],ACM, pp.23-27,2006.
    [58]D. Y. Chena, H. R. Tyanb, D. Y. Hsiaoa, et al. Dynamic visual saliency modeling based on spatiotemporal analysis [C], ICME 2008, pp.1085-1088,2008.
    [59]O. Le Meur, D. Thoreau, P. Le Callet and D. Barba, A Spatio-Temporal Model of the Selective Human Visual Attention [C], ICIP,2005.
    [60]C. Stauffer andW. E. L. Grimson, Adaptive background mixture models for real-time tracking [C], Proc. CVPR,Ⅱ-2246-2252,1999.
    [61]J. Zhang, J.D. Sun, H. Yan, et al. Visual Attention Model with Cross-Layer Saliency Optimization [C]. IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp.240-243,2011.
    [62]Antonio Albiol, Julia Silla, Alberto Albiol, et al. Automatic Video annotation and Event Detection for Video Surveillance [C]. Crime Detection and Prevention, pp.1-5,2009.
    [63]Omar Javed and Mubarak Shah. Tracking and Object Classification for Automated Surveillance [J], SpringeLink on Computer Vision, volume 2353, pp.439-443, 2006.
    [64]F. Jang, J.S. Yuan, Sotirios A, et al. Anomalous video event detection using spatiotemporal context [J], Computer Vision and Image Understanding,115(3), pp.323-333,2011.
    [65]Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, et al. Event detection and recognition for semantic annotation of video, Multimedia Tools and Applications, 51(1), pp.279-302,2011.
    [66]F. Jiang and Y. Wu. A Dynamic Hierarchical Clustering Method for Trajectory-Based Unusual Video Event Detection [J], IEEE Transactions on Image Processing,18(4), pp.907-913,2009.
    [67]C. Liu, G.J. Wang, W.X. Ning, et al. Anomaly Detection in Surveillance Video Using Motion Direction Statistics [C]. IEEE International Conference on Image Processing, pp.717-720,2010.
    [68]Rubinstein M, Shamir A, Avidan S. Multi-operator media retargeting[J]. ACM Transactions on Graphics(TOG),23(3), pp.1-8.2009.
    [69]Ben-Shaul I and Kaiser G. Coordinating Distributed Component over Internet [C], IEEE Internet Computing,2(2), pp.83-86,1998.
    [70]X. Li, C. Zhang and D. Zhang, Abandoned objects detection using double illumination invariant foreground masks [C], IEEE Int'l Conference on Pattern Recognition,2010.
    [71]J. Wen, H.F. Gong, X. Zhang, et al. Gnerative model for abandoned object detection [C], IEEE Int'l Conference on Image Processing, pp.853-856,2009.
    [72]YingLi Tian and Haowei Liu, Robust detection of abandoned and removed objects in complex surveillance videos [J], IEEE Trans, on Systems, Man, and Cybernetics, Part C:Applications and Reviews. volume:99, pp.1-12,2010.
    [73]P. Viola and M. J. Jones, Robust Real-Time Face Detection [J], International Journal of Computer Vision,57(2), pp.137-154,2004.
    [74]H. Rowley. S. Baluja, and T. Kanade. Neural network-based face detection [C], IEEE Conf. Computer Vision and Pattern Recognition, pp.203-208,1996.
    [75]E. Osuna, R. Freund and F. Girosi, Training support vector machines:an application to face detection [C], IEEE Conf. Computer Vision and Pattern Recognition, pp.130-136,1997.
    [76]吴暾华,周昌乐,快速人脸检测系统的设计与实现[J],计算机应用,25(10),pp.2351-2353,2005.
    [77]Schapire, R.E., Freund, Y., Bartlett, et al. Boost-ing the margin:A new explanation for the effectiveness of voting methods [J], The annals of statistics,26(5), pp.1651-1686,1998.
    [78]Gary R. Bradski, Computer Vision Face Tracking For Use in a Perceptual User Interface, Intel Technology Journal Q2,1998.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700