视觉对象分类：多核多示例学习

英文题名：Visual Object Classification: Multiple Kernel Multiple Instance Learning
作者：王孟月
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：视觉对象分类 ; 图像分类 ; 视觉短语 ; 多示例学习 ; 多核学习 ; 多核多示例学习
英文关键词：Visual Object Classification ; Image Classification ; Visual Phrase ; Multiple Instance Learning ; Multiple Kernel Learning ; Multiple Kernel Multiple Instance Learning
学位年度：2011
导师：陈卫东 ; 宋彦
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2011-05-06

摘要

视觉对象分类是对一组视觉图像自动地进行对象分类或者判定某幅图像是否属于某个类别,定位并提取出图像中感兴趣的目标,这是计算机视觉和模式识别领域中一个热点难点问题,对图像内容理解、图像检索等有着重要的意义。由于在现实世界中图像是千变万化的,存在着视角、亮度、尺度等变化,且其数据量在与日俱增,使得传统的手工视觉对象提取非常困难。因而需要引入机器学习的方法,根据图像的底层视觉特征对其语义概念进行分类和学习,建立复杂的视觉对象分类模型。目前国内外通常使用图像的底层视觉特征如颜色、纹理、形状以及对象的空间关系等信息来表示图像的内容,但从计算机所表达出的视觉特征到图像的实际语义表达之间存在着巨大的“语义鸿沟”。
     本文的研究方向是视觉对象分类,主要针对在使用传统的机器学习方法时使用人工标记费时费力的缺点,以及在“Bag of Words”图像表示模型中存在的语义描述能力有限的缺点,对现有的多示例学习算法进行了改进。本文的主要研究内容如下:
     1.结合分割区域的多示例学习。该算法是在MILES算法的基础上,与结合分割进行多示例学习并进行目标检测与提取。该方法在“Bag of Words”图像表示模型的基础上,将一副图像看作一个包,表示该图像的若干视觉单词作为包中示例,并把视觉单词辞典作为特征空间,通过对包中示例个数统计将其映射到特征空间中,考虑到1-norm SVM具有较好的稀疏性,随后用其来挑选重要特征的同时对图像进行分类;此后为了实现目标的提取,需要对判定为正的图像进行示例判定,然后根据判定为正的示例所在位置作为相应的目标“种子”点,进一步与图像分割结果相结合,最终实现了目标提取。在Caltech 101标准图像集上进行实验的实验证明了该算法的有效性。
     2.基于视觉短语的多示例学习。针对“Bag of Words”图像表示模型中,视觉单词的产生过程仅采用无监督聚类方法,忽略了视觉单词相互之间的空间信息,导致其语义描述能力有限且区分性能弱等缺点,本章提出了一种高阶的视觉特征取代视觉单词,即通过视觉单词在空间中的空间相互关系构建具有语义区分能力的视觉短语,可以提高“Bag of Words”图像表示模型的准确性。鉴于传统的基于“Bag of Words”模型的分类方法性能容易受到图像中背景、遮挡、尺度变化明显等因素影响导致分类精度较低等问题,本文在视觉短语的基础上,结合多示例学习思想,提出了一种用于图像分类的多视觉短语学习方法,使最终的分类模型可以反映出图像类别的区域特性。在一些标准的图像测试集合Caltech 101和Scene 15进行实验,实验结果表明该算法的具有很好的分类性能,与现有算法相比分类准确率相对提高了约9%和7%左右。
     3.多核多示例学习。视觉对象往往需要多种特征来进行描述的,在采用一种特征的情况的下分类会不准确,考虑到多示例学习可以处理微弱标记的图像且分类精度较高,然而在多示例学习中,通常只可以用一个特征对示例进行描述。因而考虑采用多核的方法在多示例学习中引入多种特征。因而,提出了一个多核多示例学习框架,用于解决多示例情况下的多特征学习问题。该框架是在多示例的基础上,使用多种特征对示例进行描述,训练的同时学习各种特征的权重。该框架融合了多种特征的优点,且分类精度高。在标准的图像测试集合Caltech 101上进行了实验,实验结果表明该框架具有很好的分类性能。
Visual object classification is to classify visual objects or determine the category which the image belongs to automatically, locate and extract the region of interest in the image. This is a hot and difficult issue in the field of computer vision and pattern recognition, and has great significance to the field of the analysis and understanding for the image content. As in real world scenes, the visual objects may vary in viewpoint, brightness and scale; in addition, the number of images has been growing day and day, making the traditional manual object extraction becoming difficult. Therefore Machine Learning methods are introduced to classify and learn the semantic concept according to the low level visual feature of images, and build complex visual object classification model. Now the low-level visual features such as color, texture, shape and the spatial relationship are usually used to present the content of images. However, there exists huge semantic gap, which occurs between the low level features represented by computers and the high level semantic features understood by human.
     The research direction of this thesis is visual object classification. It is mainly to address the issue of traditional learning methods in tackling the manual extraction of visual object and the limited discriminative ability of bag of words model. This thesis improves the existing multiple instance learning methods. The main research contents of this thesis are described as follows.
     1. Multiple instance learning combined segmentation. Based on MILES algorithm, we propose a novel multiple instance learning approach which combines segmentation for object detection and extraction. This approach uses“Bag of Words”model. The whole image is regarded as a multiple instance bag. The visual words that represent the image are regarded as the instances in the bag. The approach maps each bag into a feature space defined by visual vocabulary via the histogram over visual words. Next, 1-norm SVM is applied to select important features as well as classify images simultaneously. Then we will classify instances coming from the bag classified as positive, and take the positive instances for object“seed”points. After that segmentation is combined to realize object extraction. Experiments on Caltech 101 dataset show that this approach achieves high efficiency.
     2. Multiple instance learning based visual phrase. Due to the limited descriptive and discriminative ability of bag of visual words and the problem that traditional learning methods may suffer from background clutters and large appearance variations. We propose a MVPL (Multiple Visual Phrase Learning) method for image classification. In MVPL, the visual phrase is first generated from over-segmented image regions of homogeneous appearance and visual words within each region, which may provide enhanced descriptive ability by introducing the spatial coherency. Then a devised MIL algorithm is applied to efficiently learn from the weakly labeled image data. The experiment results on benchmark dataset Caltech 101and Scene 15 show that our proposed method significantly outperforms the state-of-the-art algorithms about 9% and 7% respectively.
     3. Multiple kernel multiple instance learning. Visual object is often associated with multiple visual measurements If the object is represented by only one feature, the final classification result can be wrong when information is insufficient. MIL is a natural tool for processing the weakly labeled dataset and has high classification accuracy. However, there is only one feature vector that can be used to represent each instance in the bag. Therefore we propose a novel framework: Multiple Instance Multiple Kernel Learning (MIMKL), which figures out the combination problem of various features in MIL. This framework, which based on MIL, uses multiple features to describe the instance and compute combined kernel weights when training. It combines the advantages of multiple features and has high classification accuracy. The experiment results on benchmark dataset Caltech-101 show the efficiency of our proposed method.

引文

戴宏斌,张敏灵,周志华. 2006.一种基于多示例学习的图像检索方法[J].模式识别与人工智能, 19(2): 179– 185.
    黎铭,薛晓冰,周志华.2004.基于多示例学习的中文Web目录页面推荐[J].软件学报,9:1328-1335.
    徐光祜. 2002.计算机视觉[M].
    张敏灵,周志华.2006.一种基于多示例学习的图像检索方法[J].模式识别与人工智能, 4:179-1 85.
    Amoid WM, Marce W, Simone S, et al. 2000. Content-based image retrieval at the end of the early years[J].IEEE Transaction OIl Pattern Analysis and Machine Intelligence, 22(12):1349-1379.
    Battiato S, Farinella GM, Gallo G, RavìD. 2009. Spatial Hierarchy of Textons Distributions for Scene Classification[C]. In Proceedings of International MultiMedia Modeling Conference, 333-343.
    Bi J, Bennett K P, Embrechts M, et al. 2003. Dimensionality Reduction via Sparse Support Vector Machines [J]. Journal of Machine Learning Research, 3: 1229-1243.
    Blei D, Ng A, Jordan M. 2003. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research. 3: 993-1022
    Bonnans, Frederic J, Shapiro, Alexander. 2000. Perturbation Analysis of Optimization Problems[M]. XVIII, 601 p.
    Bosch A, Zisserman A, Munoz X. 2006. Scene classification via Plsa[C]. In Proc. ECCV.
    Bosch A, Zisserman A, Munoz X. 2008. Image Classification Using ROIs and Multiple Kernel Learning[J]. International Journal of Computer Vision.
    Cao LL, Li FF. 2007. Spatially Coherent Latent Topic Model for Concurrent Object Segmentation and Classification[C]. Proceeding of the 11th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil, 1080– 1087.
    Chevaleyre Y, Zucker JD. 2001. Solving multiple-instance and multiple-part learning problems with decision trees and decision rules. Application to the mutagenesis problem. In: Stroulia E, Matwin S, eds. Lecture Notes in Artificial Intelligence 2056, Berlin: Springer-Verlag, 204-214.
    Chen Y, Bi J, Wang J. 2006. MILES: Multiple-instance learning via embedded instance selection[J]. IEEE Trans on Pattern and Analysis and Machine Intelligence, 28(12): 1931-1947.
    Dalal N, Triggs B. 2005. Histograms of oriented gradients for human detection Computer [C]. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and PatternRecognition, 1(1): 886-893.
    Dietterich TG, Lathrop RH. Lozano-Pérez T. 1997. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2): 31-71.
    Everingham M, Van Gool L, Williams C, et al. 2009. The Pascal Visual Object Classes (VOC) Challenge[J]. International Journal of Computer Vision, 88(2):303–308.
    Fergus R, Li Feifei, Perona P, Zisserman A. 2005. Learning Object Categories from Google’s Image Search [C]. Proceeding of the 10th International Conference on Computer Vision (ICCV), 1816– 1823.
    Friedman J H, Stuetzle W. 1981. Projection pursuit regression[J]. Journal of the American Statistical Association, 76(376): 817-823.
    Galleguillos C, Babenko B, Rabinovich A, Belongie S. 2008. Weakly Supervised Object Localization with Stable Segmentations[C]. Proc. of ECCV.
    Goldman SA, Kwek SS, Scott SD.2001. Agnostic learning of geometric patterns[J].Journal of Computer and System Sciences, 62(1): 123-151.
    Goldman SA, Scott SD. 2003. Multiple-instance learning of real-valued geometric patterns[J]. Annals of Mathematics and Artificial Intelligence, 39(3):259-290.
    Han, Jiawei, Pei Jian, Yin Yiwen. 2000. Mining Frequent Patterns without Candidate Generation [C]. International proceeding of ACM SIGMOD, Dallas, 29(2):1-12.
    Hedelin P, Skoglund J. 2000. Vector Quantization Based on Gaussian Mixture Models [J]. IEEE Trans. Speech Audio Processing, 8: 385– 401.
    Huang X,Chen SC,Shy M, et.al.2002. User concept pattern discovery using relevance feedback and multiple—instance learning for content-based image retrieval [C].MDM/KDD 2002 Workshop.Edmonton, 100-108.
    Kadir T, Brady M. 2001. Scale, Saliency and Image Description [J]. International Journal of Computer Vision, 45(2): 83-105.
    Lanckriet GRG, Cristianini N, Bartlett P, et al. 2004. Learning the kernel matrix with semidefinite programming [J]. JMLR, 5:27–72.
    Lazebnik S, Schmid C, Ponce J. 2006.Beyond Bags of Features: Spatial Pyramid Matching for Recognition Natural Scene Categories [C]. Proceeding of the 24th IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, USA, 2: 2169 - 2178
    Lewis D. 1998. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval [C]. Proceedings of ECML-98, 10th European Conference on Machine Learning. Springer Verlag, 4-15.
    Li FF, Perona P. 2005. A Bayesian Hierarchical Model for Learning Natural Scene Categories [C].Proceeding of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 524-531.
    Li FF, Fergus R, Perona P. 2006. One-shot learning of Object Categories [J]. IEEE Trans on Pattern and Analysis and Machine Intelligence, 28(4): 594– 611.
    Li FF, Fergus R, Perona P. 2007. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories[J]. Computer Vision and Image Understanding, 106(1): 59 - 70.
    Liu D, Hua G, Viola P, Chen T. 2008. Integrated Feature Selection and Higher-order Spatial Feature Extraction for Object Categorization[C]. Proceeding of the 26th IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK.
    Lowe D G. 2004. Distinctive Image Features form Scale-invariant Keypoints [J]. International Journal of Computer Vision, 60(2): 91– 110.
    Manik Varma, Debajyoti Ray. 2007. Learning the discriminative power invariancetrade off[C]. IEEE International Conference on In Computer Vision, 1-8.
    Maron O, Lozano-Perez T. 1998a. A Framework for Multiple-Instance Learning [C]. Proceedings of Neural Information Processing Systems, 10: 570-576
    Maron O, Ratan A. 1998b. Multiple-instance learning for natural scene classification[C]. In: ICML.
    Maron O. 1998c. Learning from ambiguity [Ph.D. Thesis]. Cambridge: Massachusetts Institute of Technology.
    Mikolajczyk K, Tuytelaarsl T. 2005. A comparison of affine region detectors[J]. International Journal of Computer Vision, 65 (1–2) : 43–72.
    Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y. 2007. More Efficiency in Multiple Kernel Learning[C]. In Proceedings of ICML'2007, 775-782.
    Ruffo G. 2000. Learning single and multiple instance decision trees for computer security applications.PhD dissertation, Department of Computer Science, University of Turin, Torino, Italy, Feb.
    Rui Y, Huang TS, Ortega M, Mehrotra S.1998. Relevance feedback:a power tool for interactive content-based image retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology, 8(5):644-655.
    Savarese S, Winn J, Criminisi A. 2006. Discriminative object classmodels of appearance and shape by correlations[C]. In CVPR, 43: 3013-3024.
    Sivic J, Russell BC, Efros AA, et al. 2005. Discovering object categories in image collections[C]. Proc. of the 10th IEEE International Conference on Computer Vision. Beijing, China.
    Song Yu-qing, Zhu Yu-quan, Shu Zhi-hui. 2003. An Algorithm and Its Updating Algorithm Based on FP-Tree for Mining Maximum Frequent Itemsets[J]. Journal of Software, 14(9): 385-388.
    Timothee C, Florence B, Jianbo S. 2005. Spectral Segmentation with Multiscale Graph Decomposition [C]. Proceeding of the 23th IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2: 1124– 1131.
    Vedaldi A, Gulshan V, Varma M, Zisserman A. 2009. Multiple kernels for object detection[C]. In Proc. ICCV, 606-613.
    Wang J, Zucker JD.2000. Solving the multiple—instance problem:a lazy learning approach[C]. In Proceedings of the 17th International Conference on Machine Learning, San Francisco,CA, 1119-1125.
    Weiss GM, Hirsh H.1998. Event prediction:learning from ambiguous examples.Presented at the 1998 Neural Information Processing Systems (NIPS) Workshop on Learning from Ambiguous and Complex Examples.
    Wu Z, Ke QF, Sun J. 2009. Bundling features for large-scale partial-duplicate web image search[C]. In Proc. CVPR, 25-32.
    Yang C, Lozano-Perez T.2000. Image database retrieval with multiple instance learning techniques[C].Proc.of the 16th Int.Conf.on Data Engineering, 233-243.
    Yang JC, Yu K, Gong YH, Huang T. 2009. Linear spatial pyramid matching using sparse coding for image classification[C]. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 1794-1801.
    Yuan YS, Wu Y, Yang M. 2007. Discovery of Collocation Patterns: from Visual Words to Visual Phrases[C]. Proc. of the 25th IEEE Conference on Computer Vision and Pattern Recognition, 1-8.
    Zhang Q, Goldman S A. 2001. EM-DD: an improved multiple-instance learning technique. Advances in Neural Information Processing Systems, Cambridge, CA: MIT Press, 1073-1080.
    Zhang Q,Goldman S A,Yu W, et al.2002. Content—based Image Retrieval Using Multiple-instance Learning[C].The Nineteenth Int. Conf. on Machine Learning, Sudney, (2):682-689.
    Zhang SZ, Yang HN, Wang XK. 2004. Application of Online Learning Algorithm for Bayesian Network Parameter [J]. Journal of Chinese Computer Systems, 25(10): 1799-1801.
    Zheng YT, Zhao M, Neo SY, Chua TS, Tian Q. 2008. Visual Synset: towards a Higher-level Visual Representation[C]. In Proc.CVPR, Achorage, Alaska, U.S..
    Zhou ZH, Jiang K, Li M.2005. Multi-instance learning based web mining[J].Applied Intelligence, 22(2):135-147.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700