基于计算机视觉的物体分类关键技术研究

英文题名：Key Technologies in Object Categorization Based on Computer Vision
作者：朱荣
论文级别：博士
学科专业名称：通信与信息系统
中文关键词：物体识别 ; 语义特征 ; 词袋算法 ; k-均值聚类 ; 核函数 ; 支持向量机 ; 尺度分量
英文关键词：Object Recognition ; Semantic Character ; Bag of Words ; k-Means
英文关键词：Clustering ; Kernel Function ; Support Vector Machine ; Scale Component
学位年度：2011
导师：胡瑞敏
学科代码：081001
学位授予单位：武汉大学
论文提交日期：2011-10-30

摘要

物体识别是当前国内外计算机视觉领域一个活跃的研究方向,物体识别的本质就是建立一个能够识别出图像中感兴趣物体类别的计算系统,在现实生活中有着广泛的应用需求,具有相当高的应用价值和研究意义。近年来,随着模式分类技术的不断成熟以及人工智能的持续发展,基于语义特征提取的物体识别技术逐渐被广大学者所接受。物体的语义特征就是通过提取一类物体的局部特征,然后按照一定的处理准则将局部特征转化为描述一类物体的语义信息,形成一类物体的语义特征模型,实现可行有效的物体分类识别效果。
     由于物体图片的信息量大,计算复杂度高,如何利用物体有效的特征进行自动物体分类识别,目前算法在实际应用中面临很大的挑战性,本文首先总结物体识别的研究现状和目前存在的问题,介绍了物体识别与分类的算法基本框架,视觉不变性特征的比较,然后深入研究了SIFT的尺度分量所携带的信息,设计了两级匹配的层次聚类算法,有效的提高了匹配正确率。以此为基础,研究了SIFT的词袋算法框架,通过支持向量机选择视觉词汇的特征点,实验表明,性能优于k均值聚类算法。最后完整的描述了系统的实验过程,分析了实验结论,证明了本文提出算法的有效性。
     本文在视觉物体分类和识别的研究中,主要的研究内容和创新点如下：
     (1)基于尺度分量的两级SIFT特征匹配算法
     在物体分类中,SIFT特征具有尺度空间不变性,一般的应用中,直接在全部样本空间实现全搜索,基于最近邻和次近邻比值门限来判断是否匹配成功,该方法带来两个问题,一是误匹配,二是无法回避物体内部的自相似特征点。本文分析了同类物体在不同相机参数下的匹配特征点尺度关系,计算观测物体的相对尺度,设计两级匹配方法,将尺度分量用于决策过程,提高匹配的精度和效率。
     (2)基于SVM的视觉词汇生成方法
     现有Bag of Words算法以描述符的聚类中心作为视觉单词,但是该方法会产生严重的语义丢失现象。本文提出了两种基于决策机制的视觉单词生成方法,通过决策机制,选取若干类内有效特征点代替聚类中心,形成语义丰富的视觉单词,丰富了视觉词汇表中的语义信息,提高了物体识别过程中的特征点查全率。选取最适合高维数据的SVM非线性分类器,实现了特征描述符到视觉单词的转化以及待测物体的描述符归类过程,提高了语义特征表达的有效性,增加了物体识别效率。
     (3)基于小词库集的视觉物体分类方法
     在物体分类中,视觉词汇包分类方法一般基于统一的大词库词典,典型的基于直方图的贝叶斯后验概率分类器。本文针对少数待识类别情况下的应用,提出了一种一类物体一个词库的分类方法,每一类词汇表明显小于统一大词库,系统稳健性明显提高。
Object recognition is an active research domain in computer vision research direction currently. An object recognition system will identify an object category in the video or image sequence. It is important in real life and science research. In recent years, with the pattern classification technology and artificial inteligence's development continuely, semantic feature extraction based on object recognition technology is increasingly being accepted by the majority of scholars. Semantic features of the object is a class of objects by extracting local features, treatment and follow certain guidelines will describe the local features of a class of objects into the semantic information, form a class of objects in the semantic feature model, achieving a reliable and effective object classification and recognition results.
     According to large amount of information as an object image and the computational complexity, it is a great challenge that extracting and matching the features for effective object recognition. This paper summarizes the state art of object recognition, describes the object recognition and classification of the basic framework of the algorithm, the comparison of visual invariant features, and in-depth study of the SIFT component carried by the scale of information, design a hierarchical clustering algorithm, which effectively improve the matching accuracy. The Bag of Words framework based on SIFT was analysised. The support vector machine was used for feature selection point of the visual vocabulary. The experiments show better performance than the k-means clustering algorithm. Finally, a complete description of the experimental system, analyzed the experimental results prove the effectiveness of the proposed algorithm.
     This article focuses on the visual object classification and recognition. The main innovations are as follows:
     (1) The SIFT matching algorithm based on the scale component
     Object classification, SIFT feature space with the scale invariance, the general application, directly in the whole sample space for full search, nearest neighbor and second nearest neighbor based on the ratio of the threshold to determine whether the match is successful, the approach creates two problems, one Is a mismatch, the second is within the object can not avoid self-similar feature points. This paper analyzes the same objects under different camera parameters between matched feature point scale to calculate a relative measure of observed objects, design two matching method, the weight scale for decision-making process and improve the matching accuracy and efficiency.
     (2) The algorithm for the production of visual vocabulary in visual object classification
     Bag of Words of existing algorithms to describe the cluster center as a visual symbol of the word, but the method will have a serious phenomenon of semantic loss. In this paper, two kinds of decision-making mechanism based on the visual word generation method, through the decision-making mechanism, valid for certain types of selected feature points instead of the cluster center, forming a rich visual semantics of the word, enriching the visual vocabulary of the semantic information to improve the object identify the feature points in the process of recall. Select the most suitable for high dimensional data nonlinear SVM classifier, feature descriptor to achieve the transformation of the visual word and object descriptors tested classification process and improve the effectiveness of semantic feature representation, increasing the efficiency of object recognition.
     (3) Object Categratzation based on small visual vocabulary database
     Object classification, visual Bag of Words classification methods are generally based on a unified large vocabulary dictionary, a typical histogram based on Bayesian posterior probability classifier. This type of knowledge for the small number of cases pending application, a class of objects the classification of a thesaurus, each category was significantly less than the unified vocabulary thesaurus, the system stability has improved significantly.

引文

[1]L. Roberts. Machine perception of three-dimensional solids. In J. Tippett et al., editors, Optical and Electro-Optical Information Processing, pages 159-197. MIT Press, Cambridge, MA,1965.
    [2]D. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Norwell, MA,1985.
    [3]D. Thompson and J. Mundy. Three-dimensional model matching froman unconstrained viewpoint. In Proceedings, IEEE International Conference on Robotics and Automation, pages 4:208-220,1987.
    [4]A. Leonardis and H. Bischo_. Dealing with occlusions in the eigenspace approach. In Proceedings, IEEE Conference on Computer Vision and Pattern Recognition, pages 453-458, San Francisco, CA, June 1996.
    [5]H. Murase and S. Nayar. Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision,14:5-24,1995.
    [6]G. Carneiro and N. Vasconcelos, Formulating semantic image annotation as a supervised learning problem, in Proc. IEEE Conf. Computer Vision and Pattern Recognition,2005, pp.163-168.
    [7]J. Fan, Y. Gao, and H. Luo, "Multi-level annotation of natural scenes using dominant image components and semantic concepts, " in ACM Multimedia,2004, pp. 540-547.
    [8]S. Ullman and R. Basri. Recognition by linear combinations of models. IEEE Transactions on Pattern Analysis and Machine Intelligence,13(10):992-1006, October 1991.
    [9]M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience,3(1):71-86,1991.
    [10]B. Schiele, J. L. Crowley. Object recognition using multidimensional receptive field histograms[C], In Proceedings of the 4th European Conference on Computer Vision,1996:610-619.
    [11]Swain M. J, Ballard D H. Color indexing. International journal of Computer Vision, 1991,7(1):11-32.
    [12]P. Duygulu, K. Barnard, J. de Freitas, and D. A. Forsyth, Object recognition as machine translation:Learning a lexicon for a fixed image vocabulary, in Proc. Eur. Conf. Computer Vision,2002, pp.97-112.
    [13]Cordelia Schmid and Roger Mohr. Local grayvalue invariants for image retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence,19:530{535,1997.
    [14]J. Jeon, V. Lavrenko, and R. Manmatha, Automatic image annotation and retrieval using cross-media relevance models, in Proc.26th ACM SIGIR Conf.,2003, pp. 119-126.
    [15]D. Lowe. Object recognition from local scale-invariant features. In Proc. International Conference on Computer Vision, pages 1150-1157,1999.
    [16]D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91-110, November 2004.
    [17]熊英,马惠敏.SIFT特征在三维物体识别中的应用[A].见：北京图象图形学会.第四届图像图形技术与应用学术会议论文集[C],2009.
    [18]何苗.相关滤波器和数字全息在三维物体识别中的应用[D].北京：北京大学,2009.
    [19]Fergus, R., Perona, P., Zisserman, A. Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition, in International Journal of Computer Vision, in print,2006.
    [20]Fergus, R., Perona, P. and Zisserman, A. Object Class Recognition by Unsupervised Scale-Invariant Learning. Proc. of the IEEE Conf on Computer Vision and Pattern Recognition,2003.
    [21]G. Csurka, C. Dance, L. Fan, J. Williamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV 04 workshop on Statistical Learning in Computer Vision, pages 59-74,2004.
    [22]彭绍武.基于形状与语义建模的物体识别[D].武汉：华中科技大学,2009.
    [23]L. Wu, S. C. H. Hoi, and N. Yu, Semantics-Preserving Bag-of-Words Models and Applications, Journal of IEEE Transactions on image processing, vol.1, no.1, pp.1-12, February 2010.
    [24]Timor Kadir and Michael Brady. Saliency, scale and image description. Intern. Journal of Computer Vision,45(2):83-105,2001.
    [25]S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence,24(4):509-522,2002.
    [26]K. Mikolajczyk, B. Leibe, and B. Schiele, Local features for object class recognition, in Proc. IEEE Int. Conf. Computer Vision,2005.
    [27]Y. Ke and R. Sukthankar. PCA-SIFT:A More Distinctive Representation for Local Image Descriptors. Computer Vision and Pattern Recognition,2004.
    [28]S.Zhu. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence,25 (6):691-712,2003.
    [29]G. Csurka, C. Dance, L. Fan, J. Williamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV 04 workshop on Statistical Learning in Computer Vision, pages 59-74,2004.
    [30]Pedro F. Felzenszwalb and Joshua D. Schwartz, "Hieratchical Matching of Deformable Shape", in IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2007)[C],2007.
    [31]Praveen Srinivasan and Jianbo Shi, "Bottom-up Recognition and Parsing of the Human Body", in IEEE Computer Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR2007) [C],2007.
    [32]Anat Levin and Yair Weiss, "Learning to Combine Bottom-Up and Top-Down Segmentation", in Computer Vision-EECV 2006,9th European Conference on Computer Vision (4) [C],2006, pp,581-594.
    [33]物体识别中的视点问题,心理科学进展,2006年1月14卷1期：57-61
    [34]Marr D.视觉计算理论.姚国正等译.北京：科学出版社,1988.282-345
    [35]Biederman I. Recognition-by-Components:a theory of human image understanding. Psychological Review,1987,94(2):115^147
    [36]Tarr M J, Vuong Q C. Visual object recognition. In:H Pashler (Series ed.), S Yantis (ed.). Stevens'handbook of experimental psychology:Vol.1. sensation and perception (3rd ed., Vol.1). New York, NY:John Wiley & Sons, Inc.,2002.287-314
    [37]Tarr M J, Pinker S. Mental rotation and orientation-dependence in shape recognition. Cognitive Psychology,1989,21(2):233^282
    [38]Stankiewicz B J. Empirical evidence for independent dimensions in the visual representation of three-dimensional shape. Journal of Experimental Psychology: Human Perception and Performance,2002,28:913-932
    [39]Wilson K D, Farah M J. When does the visual system use viewpoint-invariant representations during recognition? Cognitive Brain Research,2003,16:399-415
    [40]S. Zhu. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence,25 (6):691-712,2003.
    [41]赵灵芝.基于兴趣点多特征融合的物体识别方法研究[D].重庆：重庆邮电大学学报,2010.4： 47-48
    [42]M. A. Turk, A. P. Pentland. Face recognition using Eigenfaces[C]. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition,1991:586-591.
    [43]B. Schiele, J. L. Crowley. Object recognition using multidimensional receptive field histograms [C], In Proceedings of the 4th European Conference on Computer Vision,1996:610-619.
    [44]M. J. Swain, D. H. Ballard. Color indexing[J]. International Journal of Computer Vision,2001,7(1):11-32.
    [45]Csurka, C. Dance, L. Fan, J. Williamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV 04 workshop on Statistical Learning in Computer Vision, pages 59-74,2004.
    [46]L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learningnatural scene categories, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005, vol.2, pp.524-531.
    [47]Bosch. A, A. Zisserman, X. Munoz. Scene classification via Pisa. In Proc. ECCV, 2006.
    [48]J. Sochman and J. Matas, Learning a fast emulator of a binary decision process, in Proc. Asian Conf. Computer Vision,2007.
    [49]F. Jurie and B. Triggs, Creating efficient codebooks for visual recognition, in Proc. IEEE Int. Conf. Computer Vision,2005.
    [50]Fergus, R., Perona, P., Zisserman, A. Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition, in International Journal of Computer Vision, in print,2006.
    [51]Fergus, R., Perona, P. and Zisserman, A. Object Class Recognition by Unsupervised Scale-Invariant Learning. Proc. of the IEEE Conf on Computer Vision and Pattern Recognition,2003.
    [52]田渊栋.基于特征组合的一般物体识别相关算法研究[D].上海：上海交通大学,2007.
    [53]K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, IEEE Trans. Pattern Anal. Mach. Intel 1., vol.27, no.10, pp.1615-1630, Oct. 2005.
    [54]K. Mikolajczyk, B. Leibe, and B. Schiele, Local features for object class recognition, in Proc. IEEE Int. Conf. Computer Vision,2005.
    [55]J. Zhang, M. Marszatek, S. Lazebnik, and C. Schmid, Local features and kernels for classification of texture and object categories:A comprehensive study, Int. J. Comput. Vis., vol.73, no.2, pp.213-238,2007.
    [56]F. Jurie and B. Triggs, Creating efficient codebooks for visual recognition, in Proc. IEEE Int. Conf. Computer Vision,2005.
    [57]K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision,60(1):63-86,2004.
    [58]Timor Kadir and Michael Brady. Saliency, scale and image description. Intern. Journal of Computer Vision,45(2):83-105,2001.
    [59]Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005,27(10):1615-1630.
    [60]Antonopoulos. P, Nikolaidis. N, Pitas. I. Hierarchical Face Clustering using SIFT Image Features[J]. Honolulu, HI,2007,5(1):325-329.
    [61]Richard 0. Duda, Peter E. Hart, David G. Stork.李宏东,姚天翔等译.模式分类[M].北京：机械工业出版社,2005：444.
    [62]Matthew Brown and David Lowe. Invariant Features from Interest Point Groups [J]. In British Machine Vision Conference, Cardiff, Wales,2002, pp.656-665.
    [63]Richard 0. Duda, Peter E. Hart, David G. Stork. Pattern Classification (second edition).2004:526.
    [64]Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek, "Visual Word Ambiguity," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, no.7, pp.1271-1283, June 2010
    [65]G. Griffin, A. Holub, and P. Perona, Caltech-256 object category dataset, Caltech, Tech. Rep. UCB/CSD-04-1366,2007.
    [66]朱凯.精通MATLAB神经网络.北京：电子工业出版社,2010：306.
    [67]E. N. Mortensen, H. Deng, L. Shapiro. A sift descriptor with global context. In Proceedings of International Conference Computer Vision and Pattern Recognition, San Diego, America,2005,1:184-190.
    [68]G. Griffin, A. Holub, and P. Perona, Caltech-256 object category dataset, Caltech, Tech. Rep. UCB/CSD-04-1366,2007.