基于内容分析的图像视频编码研究

英文题名：The Content-analysis Based Image and Video Coding
作者：石中博
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：图像压缩 ; 可伸缩视频编码 ; 图像集编码 ; 云存储 ; 视觉内容分析 ; 视觉模式 ; 局部特征 ; 图像特征编码 ; 全局相似性 ; 图像校正
英文关键词：image compression ; scalable video coding ; image set compression ; cloud
英文关键词：storage ; visual content analysis ; learning based visual pattern ; local
英文关键词：feature ; feature coding ; global similarity ; image alignment
学位年度：2014
导师：吴枫 ; 李厚强
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2014-05-06

摘要

数字图像视频的压缩技术已经研究超过二十年,并取得了巨大的成功。然而,多年的开发使得以“预测-变换”为主的传统编码框架越来越接近其性能极限。我们有必要从新的角度分析并理解数字图像视觉内容,开发新的编码压缩方法。最近十年中,计算机视觉领域的快速发展启发我们可以从视觉内容分析出发,发掘图像的视觉相关性,改进图像视频编码性能。
     在本论文中,我们重点研究视觉内容分析技术与图像视频编码技术的结合,借助视觉内容分析技术对图像间视觉相关性进行分析,消除图像和视频中不同层面的视觉冗余,从而提高图像视频的编码效率。本论文的主要贡献可以总结为三个部分的工作。
     在第一部分工作中,我们提出了一种基于视觉模式分析的图像编码方法。该方法通过先验视觉模式描述图像低频和高频之间的视觉相关性,并由此自适应地在编码端丢弃图像中的某些高频视觉冗余,改善编码性能。同时在解码端,根据视觉模式包含的先验信息估计并恢复丢失的高频细节,改善图像重建质量。此外,我们进一步地将基于视觉模式的分析技术扩展至可伸缩视频编码应用,提出了一种新的基于视觉模式分析的层间预测方法。该方法借助视觉模式的搜索和映射,同时发掘可伸缩视频序列在时域和空域的视觉相关性,分别生成两个高质量的层间预测信号,改善可伸缩视频编码性能。此外,我们还采用了基于参数分析的预测方法,通过分析基本层已编码的信息(比如HEVC编码中的四叉树信息)来实现复杂度较低的层间预测。我们的方法通过结合多种内容分析机制,同时提供了多环路和单环路的系统实现,在编码性能和复杂度之间取得更好的平衡。
     在第二部分的工作中,我们提出了一种基于图像特征的高效图像编码方法。该方法通过图像局部特征匹配建立起更紧密的图像局部视觉联系,再配合像素层面的相关性分析,更有效地消除视觉冗余。具体地说,我们利用多尺度小波变换和SIFT特征提取,首先将输入图像分解为全局信息和子带内局部信息,并进行编码压缩。全局信息是对输入图像的基本描述,包含有限的视觉冗余；而子带内局部信息则是从不同的小波子带中提取的SIFT局部特征。在解码端,我们利用解码的SIFT特征从云端图像数据库中,检索出一组视觉相似的图像片。然后,结合基于视觉模式的分析和映射,将这些相似图像片中的信息与解码的全局信息融合起来,重建目标图像。根据子带内SIFT特征建立的视觉联系,我们利用基于视觉模式的映射从最低频的子带开始,由低频至高频,依次将视觉相似图像片内的信息融合进入对应的子带,恢复图像不同频带内的局部细节,直到图像完整重建。我们的方法通过结合局部特征分析技术和视觉模式分析的优点,取得更高效的图像编码性能。
     在第三部分的工作中,我们提出一种基于图像特征全局相似度分析的图像集整体优化编码方法。根据图像局部特征的整体统计特性,我们定义特征距离来分析图像与图像间的全局相似度。在此基础上,我们将图像集聚类为若干个相关性更强的子集,并将每个子集中图像间的关系描述为一个加权有向图。图中每个节点代表一幅图像,每一条边由特征距离加权。通过寻找该有向图的最小权值生成树,可以得到具有最小预测代价的优化编码结构。为了进一步增强图与图之间的相关性,我们提出了一种全新的基于特征的图像间三步预测方法。首先,我们利用SIFT特征匹配和多模型几何运动估计,消除不同区域的几何形变。其次,我们引入光度变换消除图像间由于光照变化带来的差异。最后,我们利用基于块的运动补偿机制生成局部优化的预测信号。我们提出的基于图像特征的方法充分利用多种内容分析技术的优点。基于特征的全局分析技术有效地确定了优化编码结构；基于局部特征匹配的图像变换增强了图像与图像间的区域相关性；基于像素的运动补偿生成了更精确的预测信号。因此,我们的方法有效地发掘了相关图像间的视觉冗余,提高了图像集整体编码的效率,同时为进一步研究大数据和云存储环境中的大尺度图像视频编码提供了新的思路。
Image and video coding have been studied for more than twenty years and achieved huge successes. However, it is more and more difficult to keep improving coding efficiency in traditional ways. It is necessary to analysis and understand visual contents from a different angle. Luckily, the latest developments in computer vision inspire us to utilize visual content analysis technologies for exploiting visual correlations.
     In this thesis, we mainly study the collaboration on visual content analysis technologies and image/video coding approaches. These visual content analysis technologies focus on exploiting the visual correlations and help us to further remove visual redundancy. The main contributions of this thesis can be described as three parts.
     In the first part of our work, we propose a novel visual pattern based image coding mode. This mode utilizes pixel-level visual pattern analysis to exploit the pixel-level visual correlation. According to prior knowledge established by visual patterns, we can adaptively discard some high-frequency redundancy on the encode side, and then restore them on the decoder side. In this way, both of higher compression ratios and better visual quality can be achieved. Besides, we extend the visual pattern based approach into scalable video coding scenario to better balance the coding performance and scalabilities. We propose a novel visual pattern based inter-layer prediction approach. This approach utilizes searching and mapping among visual patterns to exploit spatial-temporal correlations and produce two enhanced signals to improve the precision of inter-layer prediction. We also adopt the parameter analysis based approach (e.g. using the coded base-layer HEVC quadtree information) to achieve high efficient inter-layer predictions as well as limited decoding complexity increases. By involving multiple content analysis technologies, our scheme supports both multi-loop and single-loop implements.
     In the second part of our work, we propose a novel feature based image compression scheme. This scheme utilizes robust image feature matching to achieve closer local correlations. We combine pixel based analysis and region-based analysis together to further reduce visual redundancy. Specifically, we first decompose an image into the global information and local information via multi-scale wavelet transform and SIFT feature extraction, and then compress them accordingly. The global information is the basic description of an image with limited redundancy. The local information is the SIFT feature extracted in different subbands. On the decoder side, we use decoded SIFT features to establish visual correlations with images in the cloud and further generate a group of visually similar image patches. Then, we employ aforementioned visual pattern based learning and mapping to merge these patches into global information and reconstruct the target image. The reconstruction begins from the lowest subband to the highest one iteratively. Taking advantage of two kinds of analysis approaches (the pixel-level visual pattern and region-level image feature), our method demonstrates good visual quality at very high compression ratios.
     In the final part of our work, we propose a novel feature based image set compression scheme. According to the global statistical characteristic of robust local features, we utilize feature distance to analysis visual similarities among images. According to feature distances, we divide a large image set into more compact subsets first. Then, we model the relationship among images in a subset as a directed graph. Each node of this graph denotes an images and the weighted value of each edge is the feature distance. By searching the minimum spanning tree with the smallest edge costs, we can achieve an optimal coding structure for a given image set. In order to further enhance visual correlations between images, we propose a feature based three-step inter-image prediction approach. At the first step, we involve multi-model estimation algorithm to estimate multiple geometric models for different image regions, and then reduce geometric distortions in these regions accordingly. Following this, we involve photometric transformation to eliminate variances caused by illumination changes. At the final step, we use block-based motion compensation to improve the local prediction precision. Our feature based scheme takes advantage of multiple content analysis approaches. The feature based global analysis determines more efficient coding structure; the feature based local matching enhances inter-image region-level correlations; the pixel based compensation produces more precise predictions. Thus, our scheme demonstrates much better performance, meanwhile shows potentials towards future big data analysis and compression in cloud environments.

引文

Ait-Aoudia S, Gabis A.2006. A comparison of set redundancy compression techniques[J]. EURASIP Journal on Applied Signal Processing,2006, pp.216-216.
    Alhwarin F, Ristic-Durrant D, Graser A.2010. VF-SIFT:very fast SIFT feature matching[M]//Pattern Recognition. Springer Berlin Heidelberg,2010:222-231.
    Atzori L, De Natale F G B.1999. Error concealment in video transmission over packet networks by a sketch-based approach[J]. Signal processing:image communication,1999,15(1):57-76.
    Au O, Li S, Zou R, et al.2012. Digital photo album compression based on global motion compensation and intra/inter prediction[C]//Audio, Language and Image Processing (ICALIP), 2012 International Conference on. IEEE,2012, pp.84-90.
    Bay H, Ess A, Tuytelaars T, et al.2008. Speeded-up robust features (SURF)[J]. Computer vision and image understanding,2008,110(3):346-359.
    Bertalmio M, Sapiro G, Caselles V, et al.2000. Image inpainting[C]//Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co.,2000, pp.417-424.
    Boykov Y, Veksler O, Zabih R.2001. Fast approximate energy minimization via graph cuts[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on,23(11):1222-1239.
    Cao Y, Wang C, Li Z, et al.2010. Spatial-bag-of-features[C]//Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on. IEEE,2010, pp.3352-3359.
    Chandrasekhar V, Takacs G, Chen D, et al.2009. CHoG:Compressed histogram of gradients a low bit-rate feature descriptor[C]//Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp.2504-2511.
    Chandrasekhar V, Takacs G, Chen D, et al.2009. Transform coding of image feature descriptors[C]//IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 725710-725710-9.
    Chang T, Chiu L, Chen J, et al.2013. Fast sift design for real-time visual feature extraction[J]. Image Processing, IEEE Transactions on,2013,22 (8):3158-3167.
    Chen C.-C., Chen Y.-Y., Lee C.-L., Peng W.-H. and Hang H.-M., CE2:Report of OBMC with Motion Merging, JCTVC-F049, Torino, July 2011.
    Chen C P, Chen C S, Chung K L, et al.2004. Image set compression through minimal-cost prediction structures[C]//ICIP,2004, pp.1289-1292.
    Chen T, Cheng M M, Tan P, et al.2009. Sketch2Photo:internet image montage[C]//ACM Transactions on Graphics (TOG). ACM,2009,28(5):124.
    Chen Y, Sun X, Wu F, et al.2005. Spatio-temporal video error concealment using priority-ranked region-matching[C]//Image Processing,2005. ICIP 2005. IEEE International Conference on. IEEE,2005,2:II-1050-3.
    Choi H, Nam J, Sim D, et al.2011. Scalable video coding based on high efficiency video coding (HEVC)[C]//Communications, Computers and Signal Processing (PacRim),2011 IEEE Pacific Rim Conference on. IEEE,2011, pp.346-351.
    Chu Y J, Liu T H.1965. ON SHORTEST ARBORESCENCE OF A DIRECTED GRAPH[J]. Scientia Sinica,14(10):1396-&.
    Dai L, Yue H, Sun X, et al.2012. Imshare:instantly sharing your mobile landmark images by search-based reconstruction[C]//Proceedings of the 20th ACM international conference on Multimedia. ACM,2012, pp.579-588.
    Daneshi M. and Guo J. Q. Image reconstruction based on local feature descriptors// http://www.stanford.edu/class/ee368/project_11/reports/
    Diestel R.2005. Graph theory.2005[J]. Grad. Texts in Math.
    Dugad R, Ahuja N.2001. A fast scheme for image size change in the compressed domain[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2001,11(4):461-474.
    Edmonds J.1967. Optimum branchings[J]. JOURNAL OF RESEARCH OF THE NATIONAL BUREAU OF STANDARDS SECTION B-MATHEMATICAL SCIENCES,1967, (4):233-+.
    Eitz M, Richter R, Hildebrand K, et al.2011. Photosketcher:interactive sketch-based image synthesis[J]. Computer Graphics and Applications, IEEE,2011,31(6):56-66.
    Fei-Fei L, Perona P.2005. A bayesian hierarchical model for learning natural scene categories[C]//Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on. IEEE,2005,2:pp.524-531.
    Fischler M A, Bolles R C.1981. Random sample consensus:a paradigm for model fitting with applications to image analysis and automated cartography [J]. Communications of the ACM,1981, 24(6):381-395.
    Freeman W T, Jones T R, Pasztor E C.2002. Example-based super-resolution[J]. Computer Graphics and Applications, IEEE,2002,22(2):56-65.
    Harris C, Stephens M.1988. A combined corner and edge detector[C]//Alvey vision conference. 1988,15:50.
    Hays J, Efros A A.2007. Scene completion using millions of photographs[C]//ACM Transactions on Graphics (TOG). ACM,2007,26(3):4.
    Hontsch I, Karam L J.2000. Locally adaptive perceptual image coding[J]. Image Processing, IEEE Transactions on,2000,9(9):1472-1483.
    Hua G, Brown M, Winder S.2007. Discriminant embedding for local image descriptors[C]//Computer Vision,2007. ICCV 2007. IEEE 11th International Conference on. IEEE,2007, pp.1-8.
    Isack H, Boykov Y.2012. Energy-based geometric multi-model fitting[J]. International journal of computer vision,2012,97(2):123-147.
    Jayant N, Johnston J, Safranek R.1993. Signal compression based on models of human perception[J]. Proceedings of the IEEE,1993,81(10):1385-1422.
    Karadimitriou K, Tyler J M.1998. The centroid method for compressing sets of similar images[J]. Pattern Recognition Letters,1998,19(7):585-593.
    Ke Y, Sukthankar R.2004. PCA-SIFT:A more distinctive representation for local image descriptors[C]//Computer Vision and Pattern Recognition,2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. IEEE,2004,2:II-506-II-513 Vol.2.
    Kim B J, Xiong Z, Pearlman W A.2000. Low bit-rate scalable video coding with 3-D set partitioning in hierarchical trees (3-D SPIHT)[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2000,10(8):1374-1387.
    Li X., "Scalable video compression via over-complete motion compensated wavelet coding," Signal Process.:Image Commun., vol.19, no.7, pp.637-651,2004.
    Lin W Y, Liu S, Matsushita Y, et al.2011. Smoothly varying affine stitching[C]//Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on. IEEE,2011, pp.345-352.
    Lindeberg T.1994. Scale-space theory:A basic tool for analyzing structures at different scales[J]. Journal of applied statistics,1994,21(1-2):225-270.
    Liu C, Yuen J, Torralba A.2011. Sift flow:Dense correspondence across scenes and its applications[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on,2011,33(5): 978-994.
    Liu D, Sun X, Wu F, et al.2008. Edge-oriented uniform intra prediction[J]. Image Processing, IEEE Transactions on,2008,17(10):1827-1836.
    Liu D, Sun X, Wu F, et al.2007. Image compression with edge-based inpainting[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2007,17(10):1273-1287.
    Lowe D G.1999. Object recognition from local scale-invariant features[C]//Computer vision,1999. The proceedings of the seventh IEEE international conference on. Ieee,1999,2:1150-1157.
    Lowe D G.2004. Distinctive image features from scale-invariant keypoints[J]. International journal of computer vision,2004,60(2):91-110.
    Lu Y, Wong T T, Heng P A.2004. Digital photo similarity analysis in frequency domain and photo album compression[C]//Proceedings of the 3rd international conference on Mobile and ubiquitous multimedia. ACM,2004, pp.237-244.
    Lucas B D, Kanade T.1981. An iterative image registration technique with an application to stereo vision[C]//IJCAI.1981,81:674-679.
    Makar M, Chang C L, Chen D, et al.2009. Compression of image patches for local feature extraction[C]//Acoustics, Speech and Signal Processing,2009. ICASSP 2009. IEEE International Conference on. IEEE,2009, pp.821-824.
    Moravec H P. Rover 1981. Visual Obstacle Avoidance[C]//IJCAI,1982, pp.785-790.
    Musatenko Y S, Kurashov V N.1998. Correlated image set compression system based on new fast efficient algorithm of Karhunen-Loeve transform[C]//Photonics East (ISAM, VVDC, IEMB). International Society for Optics and Photonics,1998, pp.518-529.
    Oliva A, Torralba A.2001. Modeling the shape of the scene:A holistic representation of the spatial envelope[J]. International journal of computer vision,2001,42(3):145-175.
    Rane S D, Sapiro G, Bertalmio M.2003. Structure and texture filling-in of missing image blocks in wireless transmission and compression applications[J]. Image Processing, IEEE Transactions on, 2003,12(3):296-303.
    Schmieder A, Cheng H, Gergel B, et al.2008. Hierarchical Minimum Spanning Trees for Lossy Image Set Compression[C]//IPCV,2008, pp.57-63.
    Schmieder A, Cheng H, Li X.2009. A Study of Clustering Algorithms and Validity for Lossy Image Set Compression[C]//IPCV,2009, pp.501-506.
    Schwarz H, Hinz T, Marpe D, et al.2005. Constrained inter-layer prediction for single-loop decoding in spatial scalability[C]//Image Processing,2005. ICIP 2005. IEEE International Conference on. IEEE,2005,2:II-870-3.
    Schwarz H, Marpe D, Wiegand T.2007. Overview of the scalable video coding extension of the H. 264/AVC standard[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2007, 17(9):1103-1120.
    Secker A, Taubman D.2003. Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression[J]. Image Processing, IEEE Transactions on, 2003,12(12):1530-1542.
    Sivic J, Zisserman A.2003. Video Google:A text retrieval approach to object matching in videos[C]//Computer Vision,2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003, pp.1470-1477.
    Sugar C A, James G M.2003. Finding the number of clusters in a dataset[J]. Journal of the American Statistical Association,2003,98(463).
    Sullivan G J, Ohm J, Han W J, et al.2012. Overview of the high efficiency video coding (HEVC) standard[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2012,22(12): 1649-1668.
    Sun J, Zheng N N, Tao H, et al.2003. Image hallucination with primal sketch priors[C]//Computer Vision and Pattern Recognition,2003. Proceedings.2003 IEEE Computer Society Conference on. IEEE,2003,2:11-729-36 vol.2.
    Sun X, Wu F.2009. Classified patch learning for spatially scalable video coding[C]//Image Processing (ICIP),2009 16th IEEE International Conference on. IEEE,2009, pp.2301-2304.
    Taubman D, Marcellin M W.2002. JPEG2000:Image Compression Fundamentals, Practice and Standards[J]. Massachusetts:Kluwer Academic Publishers,2002, pp.255-258.
    Tsai S S, Chen D, Takacs G, et al.2009, Location coding for mobile image retrieval[C]//Proceedings of the 5th International ICST Mobile Multimedia Communications Conference. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering),2009, pp.8.
    Wahadaniah V., Lim C.S, and Naing S.M.T., Constrained Intra Prediction Scheme for Flexible-Sized Prediction Units in HEVC, JCTVC-D094, Daegu, KR, Jan.2011.
    Wallace G K.1991. The JPEG still picture compression standard[J]. Communications of the ACM, 1991,34(4):30-44.
    Wei L Y, Han J, Zhou K, et al.2008. Inverse texture synthesis[C]//ACM Transactions on Graphics (TOG). ACM,2008,27(3):52.
    Weinzaepfel P, Jegou H, Perez P.2011. Reconstructing an image from its local descriptors[C]//Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on. IEEE,2011, pp.337-344.
    Wiegand T, Sullivan G J, Bjontegaard G, et al.2003. Overview of the H.264/AVC video coding standard[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2003,13(7):560-576.
    Wu F, Sun X.2008. Image compression by visual pattern vector quantization (VPVQ)[C]//Data Compression Conference,2008. DCC 2008. IEEE,2008, pp.123-131.
    Wu X, Shao M, Zhang X.2010. Improvement of H.264 SVC by model-based adaptive resolution upconversion[C]//Image Processing (ICIP),2010 17th IEEE International Conference on. IEEE, 2010, pp.4205-4208.
    Xiong H, Xu Y, Zheng Y F, et al.2011. Priority Belief Propagation-Based Inpainting Prediction with Tensor Voting Projected Structure in Video Compression[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2011,21(8):1115-1129.
    Xiong R, Wu F, Li S, et al.2004. Exploiting temporal correlation with adaptive block-size motion alignment for 3D wavelet coding[C]//Electronic Imaging 2004. International Society for Optics and Photonics,2004, pp.144-155.
    Xiong R, Xu J, Wu F.2008. In-scale motion compensation for spatially scalable video coding[J]. Circuits and Systems for Video Technology, IEEE Transactions on,2008,18(2):145-158.
    Xiong Z, Sun X, Wu F.2010. Block-based image compression with parameter-assistant inpainting[J]. Image Processing, IEEE Transactions on,2010,19(6):1651-1657.
    Xiong Z, Sun X, Wu F.2009. Image hallucination with feature enhancement[C]//Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp.2074-2081.
    Yeo C, Ahammad P, Ramchandran K.2008. Rate-efficient visual correspondences using random projections[C]//Image Processing,2008. ICIP 2008.15th IEEE International Conference on. IEEE,2008, pp.217-220.
    Yeung C H, Au O C, Tang K, et al.2011. Compressing similar image sets using low frequency template[C]/Multimedia and Expo (ICME),2011 IEEE international Conference on. IEEE,2011, pp.1-6.
    Yue H, Sun X, Yang J, et al.2013. Cloud-based image coding for mobile devices-toward thousands to one compression[J]. IEEE Transactions on Multimedia,2013,15(4):845-857.
    Zaragoza J, Chin T J, Brown M S, et al.2013. As-projective-as-possible image stitching with moving DLT[C]//Computer Vision and Pattern Recognition (CVPR),2013 IEEE Conference on. IEEE,2013, pp.2339-2346.
    Zhang W, Men A, Chen P.2009. Adaptive inter-layer intra prediction in scalable video coding[C]//Circuits and Systems,2009. ISCAS 2009. IEEE International Symposium on. IEEE, 2009, pp.876-879.
    Zhou W, Lu Y, Li H, et al.2010. Spatial coding for large scale partial-duplicate web image search[C]//Proceedings of the international conference on Multimedia. ACM,2010, pp.511-520.
    Zou R, Au O C, Zhou G, et al.2013. Personal photo album compression and management[C]//Circuits and Systems (ISCAS),2013 IEEE International Symposium on. IEEE, 2013, pp.1428-1431.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700