基于大规模视觉模式学习的高性能图像表示
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着数字媒体设备和智能手机的普及,以及社交网络和网络共享的流行,网络上的图像数据规模越来越大,相应的识别需求越来越多。大规模图像数据为图像识别领域中的识别、分类、检索等问题带来了更多的挑战,也孕育着更多的机遇。
     在过去几年里,物体检索是大规模图像检索中的热门问题。大规模词表产生的稀疏图像表示是检索中快速查询的保证,高性能图像表示是检索性能的保证。本文通过对局部特征空间中视觉模式学习和图像表示的研究,可以快速产生高性能的图像表示,提升大规模图像检索系统的性能。
     为了解决大规模图像的识别问题,视觉属性和中层图像表示最近几年成为研究热点。本文通过研究视觉属性学习和中层图像表示的产生,可以快速学习大规模视觉属性、产生可用于识别和检索的高性能表示。
     本文的主要研究工作和创新之处如下:
     (1)提出了一种快速构造高性能大规模视觉词表的算法。针对当前大规模图像检索系统的性能瓶颈,本文提出了一种快速构建高性能大规模视觉词表的算法。大规模图像检索系统依赖于大规模视觉词表,用以产生稀疏表示,进而实现快速、准确搜索。当前最好的近似算法构造大规模词表时不能同时兼顾速度和性能。本文利用近似算法迭代过程中视觉模式的继承关系,提出一种可以保证快速收敛的鲁棒近似算法。该算法基本不增加时间、空间代价。理论分析表明,算法会在有限轮收敛到精确算法的收敛解。实验验证表明,产生同等性能的视觉词表,所需时间是己有最优算法的1/10。大规模图像检索系统利用该算法可以快速产生更大规模的高性能词表,为系统的速度和性能提供技术保证。该算法也可以应用到其它视觉模式发现中,快速构造大规模视觉模式集合。
     (2)提出了一种基于给定的大规模视觉词表产生高性能图像表示的算法。大规模图像检索系统中,针对给定大规模词表后的图像表示产生问题,本文提出了一种高性能且对参数鲁棒的算法,用于量化局部特征并产生图像表示。本文分析了多重量化对提高大规模图像检索中稀疏表示性能的作用,测试了汇集环节中不同汇集方法在大规模图像检索问题中的效果,并比较了检索和识别中已有量化算法的差异。本文从高斯核函数具有的尺度选择性出发,提出一种算法,最小化核函数空间重构误差的。该算法逻辑清晰、目标简洁、求解简单,而且应用到实际实验中可以产生更好的表示。该算法可以更好地利用更多近邻信息产生高性能稀疏图像表示;学得的多重量化权重能够更好地利用距离中局部信息,使得产生的表示对于近邻参数变化更加鲁棒。
     (3)提出了一种快速产生高性能线性表示的方法。针对一般图像表示问题,本文从线性中层表示出发提出了一种间接地快速学习大量潜在视觉属性并产生高性能表示的方法。当前基于视觉属性的中层表示的各种研究,多数直接将属性模型的输出值组成一个长向量作为中层表示。这种表示方式,中层表示是模型输出的线性映射,表示具有线性不变性。本文以此为出发点,提出通过学习这样的语义子空间,间接地学习视觉属性。通过子空间学习算法可以快速学习包含大规模潜在视觉属性的语义子空间,这样的语义子空间不仅可以通过线性映射产生维度可变的高性能中层表示,而且语义空间的投影具有很强的语义性,可以借助人工标注给其语义含义命名。
     (4)提出了一种产生高性能非线性表示的方案。在一般图像表示问题中对所有线性形式表示都不能充分利用属性模型信息的缺陷。本文受其它问题中非线性表示的研究启发,提出一种基于属性的非线性中层表示方案,用以产生高性能中层表示。该非线性表示方案对视觉属性定义、属性模型学习和表示产生三个环节分别提出要求:定义高度有偏的二元分类问题,学习局部有效的支持向量机模型,最后采用恰当的尺度参数利用非线性映射产生中层表示。其中,非线性表示可以更好地利用属性模型的偏移和尺度信息,因而具有更高性能;局部有效的属性模型指明输出值中存在一定冗余信息,使后续的信息压缩成为可能;高度有偏的二元分类问题保证很容易定义大量视觉属性,且这些视觉属性都只作用于特征空间的一个局部,为产生稀疏表示提供坚实的基础。实验验证了非线性表示可以显著提高表示的性能。
     本文通过前两点的工作,提供了一种快速建立高性能稀疏表示的完整方案,对于当前大规模图像检索的系统瓶颈问题给出了有效的改进,保证大规模图像检索系统快速可以产生更高性能的高维稀疏表示。
     本文的后两点工作,从线性表示和非线性表示角度,对于视觉属性和一般图像中层表示问题进行了系统地研究。本文提出的快速产生线性表示的方法、产生高性能非线性表示的方案,为后续的视觉属性和高性能中层表示研究提供了坚实的基础。特别是本文最后给出的非线性中层表示,该方案容易得到稀疏表示,具有应用到大规模图像检索系统中解决同类物体检索问题的潜力。
     本文的研究表明,从视觉空间出发,通过研究其中的视觉模式特点并学习具体的模型,可以产生更好的图像表示,也可以为更好地理解图像的内容提供了坚实的基础;图像表示是联系图像视觉外观和语义内涵的桥梁,高性能的图像表示才为产生高性能的识别、检索结果提供坚实的基础,进而通过改进系统的其它环节推进整个研究领域的不断进步。
With the popularity of digital devices and smart mobiles, and with the popularity of social networks and photo sharing by internet, the scale of web images becomes larger and larger and there are more and more requirements for the associated applications. Large-scale image data and its associated applications are a great challenge and also a good chance for the research topics in the image recognition area, such as object detection, image classification and image retrieval.
     In the past few years, object retrieval is the hot topic of image retrieval. The sparse image representation generated by a large vocabulary is a good way for the fast search in image retrieval. By our studies on learning visual pattern in local feature space and on image representation, we can generate high-performance image representation rapidly, so as to contribute to a better image retrieval system.
     To perform the recognition for large-scale images, visual attributes learning and mid-level image representation become hot research topics in recently years. We studied the learning of visual attributes and the generation of mid-level representation, to learn large-scale attributes rapidly and generate high-performance mid-level representation for recognition and retrieval.
     Our contributions and novelty are summarized as follows.
     (1) To handle the bottleneck of the available large-scale image retrieval system, we proposed an algorithm for the fast construction of high-performance visual vocabulary. Large-scale image retrieval system depends on large-scale vocabulary, to generate sparse representation indexed by inverted table for fast and exact search. Using the inheritance of visual patterns in the iterations of approximate algorithm, we proposed a robust approximate algorithm that guarantees convergence rapidly. The proposed algorithm requires nearly no more consumption of time and memory. Theoretical proofs guarantee that the algorithm converges to the converged solution of the exact algorithm. The experiment results show that the speed of our algorithm is about10times that of the available state-of-the-art algorithm for generating the equivalent vocabularies. By utilizing it, large-scale image retrieval system is easy to generate an even larger vocabulary with high performance, which is an effective technical support for the search speed and performance of the retrieval system. Besides, the proposed algorithm is also used in other tasks of visual pattern discovery, to construct a set of visual patterns rapidly.
     (2) In the large-scale image retrieval system, to handle the generation of image representation, we proposed a high-performance parameter-insensitive algorithm of quantizing the local feature and generating image representation. By the locality of the Gaussian kernel function, we proposed an algorithm to minimize the kernel reconstruction error. The proposed algorithm utilizes more neighbors in a better way to generate high-performance and sparse image representation; the learnt quantization weights get more information from the distance so that the image representation is more insensitive to the neighbor number parameter.
     (3) For the representation of general images, we proposed an indirect method, motivated by linear representation, to learn large-scale latent visual attributes rapidly and generate high-performance image representation. In the area of attribute-based mid-level representation, most available works concatenate the outputs of attribute models into a long vector as the representation. We proposed to indirectly learn visual attributes by learning one semantic subspace. The subspace learning algorithm can learn large-scale latent visual attributes rapidly into the semantic subspace. The semantic subspace is rich of semantic concepts so that the linear representation generated by linear projections is high-performance. Besides, the linear projects are semantic-aware and can be manually labeled with descriptions.
     (4) In the representation of general images, we proposed a nonlinear representation based on visual attributes for high-performance representation. All the works of representing in linear form have the shortcomings that they cannot utilize all the information of attribute models. The proposed representation scheme is motivated by the nonlinear representation in other problems. The scheme contains requirements for the3procedures, the attribute definition, the attribute model learning, and the representation generation:the attribute is defined as a quite biased binary classification; the learning model is advised to use supper vector machine; the representation is generated by nonlinear mapping with a proper scale value as the parameter. The experiments show that nonlinear representation can improve the representation significantly.
     By the former2works, we proposed a scheme to generate high-performance sparse representation, which guarantee that the large-scale image retrieval system can generate high-dimension sparse representation rapidly.
     The latter2works study the visual attribute and mid-level in the views of both the linear representation and nonlinear representation. The proposed method to fast learn liner representation and the proposed scheme to generate high-performance nonlinear representation are helpful for the future works on visual attributes and high-performance mid-level representation.
引文
Josef Sivic and Andrew Zisserma,2003. Video Google:A text retrieval approach to object matching in videos. In Computer Vision,2003. Proceedings. Ninth IEEE International Conference on (pp.1470-1477). IEEE.
    David Nister and Henrik Stewenius,2006. Scalable recognition with a vocabulary tree. In Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on (Vol. 2, pp.2161-2168). IEEE.
    James Philbin et al,2007. Object retrieval with large vocabularies and fast spatial matching. In Computer Vision and Pattern Recognition,2007. CVPR'07. IEEE Conference on (pp.1-8). IEEE.
    James Philbin et al,2008. Lost in quantization:Improving particular object retrieval in large scale image databases. In Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on (pp.1-8). IEEE.
    James Philbin and Andrew Zisserman,2008. Object mining using a matching graph on very large image collections. In Computer Vision, Graphics & Image Processing,2008. ICVGIP'08. Sixth Indian Conference on (pp.738-745). IEEE
    Xiaowei Li, et al,2008. Modeling and recognition of landmark image collections using iconic scene graphs. In Computer Vision-ECCV 2008 (pp.427-440). Springer Berlin Heidelberg.
    Antonio Torralba, Robert Fergus, and William T. Freeman,2008.80 million tiny images:A large data set for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on,30(11),1958-1970.
    Stuart Lloyd,1982. Least squares quantization in PCM. Information Theory, IEEE Transactions on,28(2),129-137.
    Shokri Z. Selim and Mohamed A. Ismail,1984. K-means-type algorithms:a generalized convergence theorem and characterization of local optimality. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (1),81-87.
    Anil K. Jain,2010. Data clustering:50 years beyond K-means. Pattern Recognition Letters, 31(8),651-666.
    Dan Judd, Philip K. McKinley, and Anil K. Jain,1996. Large-scale parallel data clustering. In Pattern Recognition,1996., Proceedings of the 13th International Conference on (Vol.4, pp. 488-493). IEEE.
    George Kollios et al,2003. Efficient biased sampling for approximate clustering and outlier
    detection in large data sets. Knowledge and data engineering, ieee transactions on,15(5), 1170-1187.
    Chanop Silpa-Anan, and Richard Hartley,2008. Optimised KD-trees for fast image descriptor matching. In Computer Vision and Pattern Recognition,2008. CVPR 2008. IEEE Conference on (pp.1-8). IEEE.
    Marius Muja, and David G. Lowe,2009. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. In VISAPP (1) (pp.331-340).
    David Arthur, and Sergei Vassilvitskii,2007. k-means++:The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics.
    Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce,2006. Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on (Vol.2, pp.2169-2178). IEEE.
    Jianchao Yang et al,2009. Linear spatial pyramid matching using sparse coding for image classification. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on (pp.1794-1801). IEEE.
    Kai Yu, Tong Zhang, and Yihong Gong,2009. Nonlinear Learning using Local Coordinate Coding. In NIPS (Vol.9, p.1).
    Jinjun Wang et al,2010. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR),2010 IEEE Conference on (pp.3360-3367). IEEE.
    Xinmei Tian et al,2008. Transductive video annotation via local learnable kernel classifier. In Multimedia and Expo,2008 IEEE International Conference on (pp.1509-1512). IEEE.
    Neeraj Kumar et al,2009. Attribute and simile classifiers for face verification. In Computer Vision,2009 IEEE 12th International Conference on (pp.365-372). IEEE.
    Ali Farhadi et al,2009. Describing objects by their attributes. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on (pp.1778-1785). IEEE.
    Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling,2009. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp.951-958). IEEE.
    Li-Jia Li et al,2010. Object Bank:A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification. In NIPS (Vol.2, No.3, p.5).
    Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon,2010. Efficient object category recognition using classemes. In Computer Vision-ECCV 2010 (pp.776-789). Springer Berlin Heidelberg.
    Alessandro Bergamo, Lorenzo Torresani and Andrew W. Fitzgibbon,2011. PiCoDes:Learning a Compact Code for Novel-Category Recognition. In NIPS (pp.2088-2096).
    Devi Parikh, and Kristen Grauman,2011. Relative attributes. In Computer Vision (ICCV),2011 IEEE International Conference on (pp.503-510). IEEE.
    Milind Naphade et al,2006. Large-scale concept ontology for multimedia. MultiMedia, IEEE, 13(3),86-91. LSCOM:Cyc ontology dated (2006-06-30), http://lastlaugh.inf.cs.cmu.edu/lscom/ontology/LSCOM-20060630.txt, http://www.lscom.org/ontology/index.html
    Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze,2008. Introduction to information retrieval (Vol.1, p.6). Cambridge:Cambridge university press.
    Haifeng Li, Tao Jiang, and Keshu Zhang,2006. Efficient and robust feature extraction by maximum margin criterion. Neural Networks, IEEE Transactions on,17(1),157-165.
    Haesun Park, Moongu Jeon, and J. Ben Rosen,2003. Lower dimensional representation of text data based on centroids and least squares. BIT Numerical mathematics,43(2),427-448.
    Jun Yan et al,2006. Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing. Knowledge and Data Engineering, IEEE Transactions on,18(3), 320-333.
    Anna Bosch, Andrew Zisserman, and Xavier Munoz,2007. Image classification using random forests and ferns.
    Andrea Vedaldi, and Andrew Zisserman,2012. Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on,34(3),480-492. http://www.vlfeat.org/index.html
    Gregory Griffin, Alex Holub, and Pietro Perona,2007. Caltech-256 object category dataset.
    Karl Pearson,1901. LⅢ. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science,2(11), 559-572.
    Peter Gehler and Sebastian Nowozin,2009. On feature combination for multiclass object classification. In Computer Vision,2009 IEEE 12th International Conference on (pp.221-228). IEEE.
    Vittorio Ferrari and Andrew Zissermanm,2007. Learning Visual Attributes. In NIPS.
    Devi Parikh and Kristen Grauman,2011b. Interactively building a discriminative vocabulary of nameable attributes. In Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on (pp.1681-1688). IEEE.
    Li-Jia Li et al,2010a. Objects as attributes for scene classification. In Proceedings of the 11th European conference on Trends and Topics in Computer Vision-Volume Part I (pp.57-69). Springer-Verlag.
    Lingqiao Liu, Lei Wang, and Xinwang Liu,2011. In defense of soft-assignment coding. In Computer Vision (ICCV),2011 IEEE International Conference on (pp.2486-2493). IEEE.
    X-J. Wang, Lei Zhang, and Wei-Ying Ma,2012. Duplicate-search-based image annotation using web-scale data. Proceedings of the IEEE,100(9),2705-2721.
    Robert E. Wilson, Samuel D. Gosling and Lindsay T. Graham,2012. A review of Facebook research in the social sciences, Perspectives on Psychological Science.
    Carolina Dania,2012. Modeling social networking privacy. In ESSoS Doctoral Symposium. Angelina I. T. Kiser,2011. Benefits and Risks of Social Networking Sites:Should they also be Used to Harness Communication in a College or University Setting, IJDLDC.
    Kenneth A. Vercammen,2012. Social Networking Websites for Business and Exposure. PRESIDENT'S PERSPECTIVE,39.
    Greg Jarboe,2011. YouTube and video marketing:An hour a day. John Wiley & Sons.
    Tamara L. Berg, Alexander C. Berg, and Jonathan Shih,2010. Automatic attribute discovery and characterization from noisy web data. In Computer Vision-ECCV 2010 (pp.663-676). Springer Berlin Heidelberg.
    Olga Russakovsky and Li Fei-Fei,2012. Attribute learning in large-scale datasets. In Trends and Topics in Computer Vision (pp.1-14). Springer Berlin Heidelberg.
    David G. Lowe,1999. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on (Vol.2, pp. 1150-1157). leee.
    Devi Parikh et al,2012. Relative Attributes for Enhanced Human-Machine Communication. In AAAI.
    Antonio Torralba, and Alexei A. Efros,2011. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on (pp.1521-1528). IEEE
    Gary B. Huang et al,2007. Labeled faces in the wild:A database for studying face recognition in unconstrained environments (Vol.1, No.2, p.3). Technical Report 07-49, University of Massachusetts, Amherst.
    David Edmundon and Gerald Schaefer,2013. Visualisation and Browsing of Flickr Retrieval Results. In Pattern Recognition (ACPR),2013 2nd IAPR Asian Conference on (pp.734-735). IEEE.
    Iljung S. Kwak et al,2013. From Bikers to Surfers:Visual Recognition of Urban Tribes.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700