海量多媒体数据的地理信息标注技术及其应用

英文题名：Geo-tagging for Large-scale Multimedia Data and Its Applications
作者：刘衡
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：图像检索 ; 地理位置信息标注 ; 图像聚类 ; 三维重建 ; 码本学习 ; 结构传播 ; 图像修补
英文关键词：image retrival ; geo-tagging ; image clustering ; 3D reconstruction ; codebook learning ; structure propagation ; image completion
学位年度：2014
导师：李厚强
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2014-05-01

摘要

随着计算机技术、通信技术和多媒体技术的飞速发展,人们可以便捷地采集图像、视频等多媒体数据,并通过网络与其它用户进行分享。整个互联网的信息数据以爆炸式的速度进行增长,给人们带来了丰富的信息资源。而图像、视频为代表的多媒体数据在其中所占的比例越来越大。如何对海量的多媒体数据进行有效的组织、管理,已经成为工业界和学术界所日益关注的问题。对多媒体数据进行自动化的地理位置标注,能够让用户方便而快速地发掘相关的多媒体数据,对于多媒体数据的存储和可视化也很有帮助,具有极为重要的理论意义和实用价值。然而,对于海量的多媒体数据进行地理位置标注面临着一些挑战,对于图像、视频等多媒体数据,我们不仅需要获取其地理位置信息,往往还希望估计出相机朝向、拍摄场景的位置、几何结构信息等,以用于虚拟导航等应用。本文针对现存的多媒体地理位置标注技术中所存在的信息标注不完备、精确度不高等问题,提出了一种基于二维图像到三维场景匹配的视觉定位技术,获得准确而完备的图像地理位置标注信息。
     本文的研究内容主要集中在基于视觉的图像地理位置信息标注方法,分别在图像的完备地理位置标注信息的估计、地理位置标注技术的优化、以及地理位置标注技术的应用等方面做出了研究。本论文的主要工作和创新之处可以总结为以下几点：
     (1)论文提出一种基于二维图像特征到三维场景模型点匹配的图像地理位置精确标注技术。首先,通过图像聚类和三维重建得到各个地理位置的三维模型。对于用户输入的图像,通过大规模图像检索匹配到相应的图像和三维场景,最终将二维图像配准到三维模型,得到包括图像相机位置、相机朝向、图像所拍摄场景位置在内的完备的图像地理位置信息,并且具有较高的精度。同时,本文还深入探讨了在移动设备上对该系统的实现以及相关的移动应用,包括为用户提供了一种基于视觉的定位和自动导航应用,帮助用户更好地了解周围环境。
     (2)本文提出一种对于图像地理位置标注进行优化以提高标注精度的算法。首先,本文提出了具有地理位置区分能力的视觉词汇码本生成方法,利用图像数据库本身所含有的地理位置标注信息作为先验知识,得到视觉码本中各个视觉单词在地理位置上的分布信息,用以衡量视觉单词对于地理位置的区分性和描述力。通过将视觉单词的区分性和描述力隐含在视觉码本中,本文实现了更好的地理位置图像检索和定位结果。本文还通过对图像场景进行分析,来提取场景几何结构,从而实现对图像地理位置信息更加准确的标注,得到图像中建筑物的几何位置信息。
     (3)本文将地理位置标注技术应用到多媒体处理中,提出了一种利用互联网海量数据来指导图像修补的算法。首先,通过大规模图像地理位置标注技术,检索得到与目标图像拍摄同一场景,并具有相似视角的参考图像。从参考图像中提取信息传播到目标图像。论文详细地讨论和分析了图像中对于图像修补具有指导作用的几种结构信息,并且设计了从参考图像中检测和提出这几种结构信息的算法。最终,论文根据所提取的几种结构信息作为先验知识,实现了多种基于结构信息指导的图像修补算法,得到了具有良好的视觉效果并符合人类视觉系统感知特性的修补结果。所提图像修补算法不同于以往的仅仅只利用目标图像本身的信息或者依赖用户的人工交互输入信息算法,是一种基于数据驱动的算法。
     总而言之,本文针对互联网上海量多媒体数据的地理位置信息标注问题,研究如何为图像估计完备而准确的地理位置信息,对现有的地理位置信息标注技术进行优化,提升系统稳定性和准确性,以及对地理位置标注技术应用到多媒体的其它方面进行了思考和讨论,考虑了一系列新问题并提出了一系列的新方法,大量的实验和应用场景验证了所提出方法的有效性。
With the rapid growth of techniques including computer science, electronic communication and multimedia technique, people can obtain information ans share them with other users on the Internet coviniently. The explosive growth of information on the Internet, brings abundant information resources for people. Image and video make up most of the internet traffic, thus the organizing and management of the large amount of multimedia data is one of the key problems that have draw lots of attention from both industry and academia. Geo-tagging, which aims to add geographical identification metadata to the multimedia data, can help users find a wide variety of location specific information. It is also benificial to the storage and visualization of these data. However, it becomes increasingly challenging to manage such an overwhelming amount of multimedia data. Not only the approximate position, but also other geographical information, including camera position, camera viewing orientation, the scene location and more specific geometric structure information, is needed for further application such as virtual navigation. In this paper, we propose a novel content-based localization approach which aligns the2D image to3D scene models to calculate the geographic information.
     In this paper we focus on technique about content-based image geo-tagging, including the estimation of comprehensive geographic parameters, the optimization of localization results and the applications in image inpainting with internet photos. The contribution of this thesis can be summarized as follows.
     Firstly, we propose a novel visual-based localization method that estimates the comprehensive geographic parameters of the given image.3D scene models are obtained by reconstruction from image clusters. For a given query image, similar images are retrived and then used to vote for related3D scene model. Finally the2D image is aligned to the3D scene model for localization. The estimated geographical parameters include the camera location, viewing direction and scene location. This comprehensive information can be used for mobile applications such as virtual navigation to help user get a better understanding of his surrounding.
     Secondly, we propose an optimization method to enhance the accuracy of geo-tagging. We propose a scheme to efficiently generate visual codebooks with strong discriminative power of different locations. Using the geo-tags of the database image as a prior knowledge, we calculate the geographic distribution of each visual word to measure their discriminative power. We get better location recognition performance with the proposed visual word weighting scheme. Furthermore, we propose to analyze the query image for more specific structure of the scene, leading to more precise geo-tagging of the image.
     Thirdly, we explore the application of geo-tagging in image processing. We present an image completion method that replaces a specified region of photographs using other reference photographs from Internet. We search candidate images that capture the same scene or building from the Internet using image geo-tagging. Then we establish geometric relationships between candidate images and the query image. The geometric relationships are represented by homography transformations estimated using viewpoint invariant local feature matches. Given these transformations, we can project the structure information from the candidate images to the target image. The extracted structure information includes line structures and region segmentation information, which are very helpful for image completion. Finally, we use such structure information for image inpainting to get fine-grained image completion results.
     In a nutshell, in this thesis, we explore and discuss techniques about geo-tagging for large amount multimedia data on the Internet from novel and distinctive perspectives and propose several applications based on geo-tagging. Compreshensive experiments demonstrate the effectiveness and efficiency of proposed algorithms.

引文

Amirshahi H, Kondo S. An image completion algorithm using occlusion-free images from internet photo sharing sites [J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences,2008,91(10):2918-2927.
    Avrithis Y, Kalantidis Y, Tolias G, et al. Retrieving landmark and non-landmark images from community photo collections[C]. Proceedings of the international conference on Multimedia. ACM,2010:153-162.
    Bay H, Tuytelaars T, Van Gool L. Surf:Speeded up robust features[M]. Computer Vision-ECCV 2006. Springer Berlin Heidelberg,2006:404-417.
    Bertalmio M, Sapiro G, Caselles V, et al. Image inpainting[C]. Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co.,2000:417-424.
    Bertalmio M, Vese L, Sapiro G, et al. Simultaneous structure and texture image inpainting[J]. Image Processing, IEEE Transactions on,2003,12(8):882-889.
    Bourke S, McCarthy K, Smyth B. The social camera:a case-study in contextual image recommendation[C]. Proceedings of the 16th international conference on Intelligent user interfaces. ACM,2011:13-22.
    Chan T F, Kang S H, Shen J. Euler's elastica and curvature-based inpainting[J]. SIAM Journal on Applied Mathematics,2002:564-592.
    Chen D M, Baatz G, Koser K, et al. City-scale landmark identification on mobile devices[C]. Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on. IEEE,2011:737-744.
    Chen Y, Luan Q, Li H, et al. Sketch-guided texture-based image inpainting[C]. Image Processing,2006 IEEE International Conference on. IEEE,2006:1997-2000.
    Criminisi A, Perez P, Toyama K. Region filling and object removal by exemplar-based image inpainting[J]. Image Processing, IEEE Transactions on,2004, 13(9):1200-1212.
    Doersch C, Singh S, Gupta A, et al. What makes Paris look like Paris?[J]. ACM Transactions on Graphics (TOG),2012,31(4):101.
    Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. International Journal of Computer Vision,2004,59(2):167-181.
    Frey B J, Dueck D. Clustering by passing messages between data points[J]. science, 2007,315(5814):972-976.
    Hartley R, Zisserman A. Multiple view geometry in computer vision[M]. Cambridge university press,2003.
    Hauagge D C, Snavely N. Image matching using local symmetry features [C]. Computer Vision and Pattern Recognition (CVPR),2012 IEEE Conference on. IEEE, 2012:206-213.
    Hays J, Efros A A. IM2GPS:estimating geographic information from a single image[C]. Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on. IEEE,2008:1-8.
    Hays J, Efros A A. Scene completion using millions of photographs[C]. ACM Transactions on Graphics (TOG). ACM,2007,26(3):4.
    He K, Sun J. Statistics of patch offsets for image completion[M]. Computer Vision-ECCV 2012. Springer Berlin Heidelberg,2012:16-29.
    Jia J, Tang C K. Image repairing:Robust image synthesis by adaptive nd tensor voting[C]. Computer Vision and Pattern Recognition,2003. Proceedings.2003 IEEE Computer Society Conference on. IEEE,2003,1:Ⅰ-643-Ⅰ-650 vol.1.
    Klinker G J, Shafer A, Kanade T. Vector quantization technique for nonparametric classifier design[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993,15(12).
    Knopp J, Sivic J, Pajdla T. Avoiding confusing features in place recognition[M]. Computer Vision-ECCV 2010. Springer Berlin Heidelberg,2010:748-761.
    Kroepfl M, Wexler Y, Ofek E. Efficiently locating photographs in many panoramas[C]. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM,2010:119-128.
    Li X, Wu C, Zach C, et al. Modeling and recognition of landmark image collections using iconic scene graphs[M]. Computer Vision-ECCV 2008. Springer Berlin Heidelberg,2008:427-440.
    Lindeberg T. Scale-space theory:A basic tool for analyzing structures at different scales[J]. Journal of applied statistics,1994,21(1-2):225-270.
    Liu H, Mei T, Luo J, et al. Finding perfect rendezvous on the go:accurate mobile visual localization and its applications to routing[C]. Proceedings of the 20th ACM international conference on Multimedia. ACM,2012:9-18.
    Lowe D G. Distinctive image features from scale-invariant keypoints[J]. International journal of computer vision,2004,60(2):91-110.
    Luo Z, Li H, Tang J, et al. ViewFocus:explore places of interests on Google maps using photos with view direction filtering[C]. Proceedings of the 17th ACM international conference on Multimedia. ACM,2009:963-964.
    Masnou S, Morel J M. Level lines based disocclusion[C]. Image Processing,1998. ICIP 98. Proceedings.1998 International Conference on. IEEE,1998:259-263.
    Nister D, Stewenius H. Scalable recognition with a vocabulary tree[C]. Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on. IEEE, 2006,2:2161-2168.
    Nister D. An efficient solution to the five-point relative pose problem[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on,2004,26(6):756-770.
    Nitzberg M, Mumford D, Shiota T. Filtering, segmentation, and depth[M]. Springer-Verlag New York, Inc.,1993.
    Park M, Luo J, Collins R T, et al. Beyond GPS:determining the camera viewing direction of a geotagged image[C]. Proceedings of the international conference on Multimedia. ACM,2010:631-634.
    Philbin J, Chum O, Isard M, et al. Object retrieval with large vocabularies and fast spatial matching[C]. Computer Vision and Pattern Recognition,2007. CVPR'07. IEEE Conference on. IEEE,2007:1-8.
    Rublee E, Rabaud V, Konolige K, et al. ORB:an efficient alternative to SIFT or SURF[C]. Computer Vision (ICCV),2011 IEEE International Conference on. IEEE, 2011:2564-2571.
    Sattler T, Leibe B, Kobbelt L. Fast image-based localization using direct 2D-to-3D matching[C]. Computer Vision (ICCV),2011 IEEE International Conference on. IEEE,2011:667-674.
    Schindler G, Brown M, Szeliski R. City-scale location recognition[C]. Computer Vision and Pattern Recognition,2007. CVPR'07. IEEE Conference on. IEEE,2007: 1-7.
    Schindler G, Krishnamurthy P, Lublinerman R, et al. Detecting and matching repeated patterns for automatic geo-tagging in urban environments[C]. Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on. IEEE,2008: 1-7.
    Schroth G, Huitl R, Chen D, et al. Mobile visual location recognition[J]. Signal Processing Magazine, IEEE,2011,28(4):77-89.
    Sivic J, Zisserman A. Video Google:A text retrieval approach to object matching in videos [C]. Computer Vision,2003. Proceedings. Ninth IEEE International Conference on. IEEE,2003:1470-1477.
    Snavely N, Seitz S M, Szeliski R. Photo tourism:exploring photo collections in 3D[J]. ACM transactions on graphics (TOG),2006,25(3):835-846.
    Sun J, Yuan L, Jia J, et al. Image completion with structure propagation[C]. ACM Transactions on Graphics (ToG). ACM,2005,24(3):861-868.
    Torralba A, Murphy K P, Freeman W T, et al. Context-based vision system for place and object recognition[C]. Computer Vision,2003. Proceedings. Ninth IEEE International Conference on. IEEE,2003:273-280.
    Triggs B, McLauchlan P F, Hartley R I, et al. Bundle adjustment—a modern synthesis[M]. Vision algorithms:theory and practice. Springer Berlin Heidelberg, 2000:298-372.
    Turcot P, Lowe D G. Better matching with fewer features:The selection of useful features in large database recognition problems [C]. Computer Vision Workshops (ICCV Workshops),2009 IEEE 12th International Conference on. IEEE,2009: 2109-2116.
    Whyte O, Sivic J, Zisserman A. Get Out of my Picture! Internet-based Inpainting[C]. BMVC.2009:1-11.
    Wu C, Frahm J M, Pollefeys M. Detecting large repetitive structures with salient boundaries[M]. Computer Vision-ECCV 2010. Springer Berlin Heidelberg,2010: 142-155.
    Zamir A R, Shah M. Accurate image localization based on google maps street view[M]. Computer Vision-ECCV 2010. Springer Berlin Heidelberg,2010:255-268.
    Zhang W, Kosecka J. Image based localization in urban environments[C].3D Data Processing, Visualization, and Transmission, Third International Symposium on. IEEE,2006:33-40

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700