基于中层语义表示的图像场景分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着多媒体技术和计算机网络技术的发展,人们接触到的图像数据以前所未有的速度增长。面对海量的图像资源,如何有效地分析、组织和管理图像数据,实现基于内容的图像检索成为多媒体技术的研究热点。场景分类(Scene Classification)任务就是在这种背景下产生的。场景分类根据给定的一组语义类别对图像数据库进行自动标注,为指导目标识别等更高层次的图像理解提供了有效的上下文语义信息。其研究的难点在于如何使计算机能够从人类的认知角度来理解图像的场景语义信息,有效辨别图像场景类内差异性和场景类间相似性。本文在场景的中层语义表示的基础上,着重讨论了如何从场景图像中提出有效的视觉特征,弥合图像低层特征和高层语义之间的语义鸿沟。围绕该问题,本文取得了以下研究成果:
     提出了一种构建类别视觉辞典的场景分类算法,该算法使用互信息作为特征选择方法来构建类别视觉辞典。根据视觉单词对给定类别的贡献度,从全局视觉辞典中选择对给定类别贡献度高的视觉单词,组成该类的类别视觉辞典,进而生成类别直方图。最终的融合直方图由基于全局视觉辞典的全局直方图和基于类别视觉辞典的类别直方图通过自适应加权合并生成,这种加权合并方法可以使类别直方图和全局直方图通过互相竞争的方式来描述图像。融合直方图不仅可以保留全局直方图的的区分能力,而且通过类别直方图加强了不同类别的相似场景的区分能力,以克服不同场景类别间的相似性问题,提高分类正确率。
     提出了一种基于不同特征粒度的多尺度多层次场景分类模型(Multi-Scale Multi-Level Generative Model, MSML-pLSA)。该模型由两部分组成:多尺度部分负责从不同尺度的场景图像中提取视觉细节,构建多尺度直方图;多层次部分将对应不同数量语义主题的场景表示线性连接生成最终的场景表示一多尺度多层次直方图MSML-pLSA模型可以在一个统一的框架下整合了不同粒度的视觉信息和语义信息,从而得到更加完善的场景描述。
     提出了一种使用无监督学习方法提取上下文信息的场景分类算法,该算法将局部视觉单词扩展到上下文视觉单词。上下文视觉单词不仅包含了当前尺度下给定感兴趣区域(Region Of Interest, ROI)的局部视觉信息,而且还包含了ROI相邻区域和相邻粗糙尺度下与ROI同中心的区域包含的视觉信息。通过引入ROI的上下文信息,上下文视觉单词能够更加有效地描述图像场景的语义信息,从而减少了图像场景语义的歧义性,进而减少了场景分类的错误率。
     研究了基于词包模型(Bag of Words, BoW)表示的特征点的数量对分类正确率的影响。在构建词包模型的过程中,如何选取特征点,以便能更好地表征图像的视觉信息是一个非常重要的工作。在场景分类领域中有一个普遍认同的观点,即较大数量的特征点可以获得较高的分类正确率,但是该观点却没有被验证过。在词包模型的框架下,本文做了大量的实验来验证这个观点,本文采用了四种特征选择方法和三种不同的SIFT特征(Scale Invariant Feature Transform)来改变特征点的数量。实验结果证明特征点的数量可以明显影响场景分类的正确率。
With the development of multimedia technology and computer network, the content-based image retrieval (CBIR) system becomes more and more important to organize, index and retrieve the massive image information in many application domains, which has emerged as a hot topic in recent years. Scene classification appears under the background above. Scene classification annotates automatically images based on a group of given semantic labels, which helps to provide effective contextual information on the higher level for image understanding task such as object recognition. The key point lies in how to train the computer to understand the semantic content of scenes from human cognition perspective, and recognize the similarities and diversities among scenes of different categories.
     Based on the middle representation of scene, our work focuses on how to extract effective visual information from the scene images and narrow down the well known semantic gap between low-level visual features and high-level semantic concepts. This paper achieves the following research results:
     Our work proposes a multiple class-specific visual dictionaries framework for scene category, where the class-specific visual dictionaries are constructed using mutual information as the feature selection method. According to the contribution of visual words to classification, universal visual dictionary is tailored to form the class-specific codebook for each category. Then, an image is characterized by a set of combined histograms which are generated by concentrating the traditional histogram based on universal codebook and the class-specific histogram grounded on class-specific codebook. Additionally, this paper also proposes a practical adaptive weighting method that leads to competition between the traditional histogram and the class-specific histogram. The proposed method can provide much more effective information to overcome the similarity of images of different categories and improve the categorization performance.
     Our work proposes a novel and practical algorithm for scene category called Multi-Scale Multi-Level pLSA model (MSML-pLSA). It consists of two parts: multi-scale part, where the image is decomposed into variant scales and diverse visual details are extracted from the layers of defferent sclaes to construct the multi-scale histogram, and multi-level part, where the representations corresponding to diverse numbers of topics are linearly concentrated to form the multi-level histogram. It is constructed to represent scene in variant visual granularity and semantic granularity. The MSML-pLSA model can create a more complete representation of the scene due to the inclusion of fine and coarse visual detail information in a joint approach and the comparative study shows the superiority of the proposed method.
     Our work presents a scene categorization approach by unsupervised learning the contextual information to extend the'bags of visual words'model to a'bags of contextual visual words model'. The contextual visual words represent the local property of the region of interest and the contextual property (from the coarser scale and neighborhood regions) simultaneously. By considering the contextual information of the ROI, the contextual visual word gives us a richer representation of the scene image which reduces ambiguities and errors.
     Our work focuses on the relationship between the number of interest points and the accuracy rate in scene classification. Here, we accept a common belief that more interest points will generate higher accuracy rate. But, few efforts have been done in this field. In order to validate this viewpoint, extensive experiments based on the bag of words method are implemented. In particular, three different SIFT descriptors and four feature selection methods are adopted to change the number of interest points. Experimental results show that the number of interest points can aggressively affect the classification accuracy.
引文
[1]Boutell M., Luo J. Review of the state of the art in semantic scene classification. Technical Report, University of Tochester. http://en.scientificcommons.org/882768.2002:1-40
    [2]Anna Bosch. Image Classification for Large Number of Object Categories. PhD Dissertation. University of Girona.2007.
    [3]John Eakins, Margaret Graham. Content based image retrieval. A report to the JISC Technology Application Programme,1999:1-63.
    [4]Datta R, Joshi D, Li J. Image retrieval:ideas, influences, and trends of the new age. ACM Computing Surveys,2008,40(2):1-60.
    [5]Tang Yingjun, Xu De, Gu Guahua. Category constrained learning model for scene classification. IEICE Transactions on.Information and System,2009,2:1811-1814.
    [6]Smeulders A.W, Worring M. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22 (12):1349-1380.
    [7]Li Lijia, Li Feifei. What, where and who? Classifying events by scene and object recognition. In Proc. of IEEE International Conference on Computer Vision (ICCV), October 2007,11:1-8.
    [81 Yanulevskaya V, Gemert J.C, Roth K. Emotional Valence Categorization Using Holistic Image Features. In Proc. of International Conference on Image Processing (ICIP), October 2008:101-104.
    [9]Peter Dunker, Stefanie Nowak, Andre Begau. Content-based Mood Classification for Photos and Music:A generic multi-modal classification framework and evaluation approach. In Proc. of the 1st International ACM Conference on Multimedia Information Retrieval (MIR), October 2008:97-104.
    [10]Vailaya A, Figueiredo M, Jain A. Image classification for content-based indexing. IEEE Transactions on Image Processing,2001,10:117-129.
    [11]Szummer M., Picard R.W. Indoor-outdoor image classification. IEEE International Workshop on Content-based Access of Image and Video Databases,1998:42-50.
    [12]Anne H.H. Ngul, Quan Z. Sheng. Combining multi-visual features for efficient indexing in a large image database. International Journal on Very Large Data Bases,2001,9(4):279-293.
    [13]Rachid Benmokhtar, Benoit Huet. Low-level Feature Fusion Models for Soccer Scene Classification. In Proc. of IEEE International Conference on Multimedia and Expo (ICME), June 2008:1329-1332.
    [14]Paek S, Chang S.F. A knowledge engineering approach for image classification based on probabilistic reasoning systems. In Proc. of IEEE International Conference on Multimedia and Expo (ICME), June 2000,2:1133-1136.
    [15]Demir Gokalp, Selim Aksoy. Scene Classification Using Bag-of-Regions Representations. In Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2007:1-8.
    [16]Liu Jingen, Mubarak Shah. Scene Modeling Using Co-Clustering. In Proc. of IEEE International Conference on Computer Vision (ICCV), June 2007:1-7.
    [17]Pedro Quelhas, Florent Monay. Thousand Words in a Scene. IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(9):1575-1589.
    [18]Josef Sivic, Bryan C. Russell. Discovering objects and their location in images. In Proc. of the Tenth IEEE International Conference on Computer Vision (ICCV), October 2005,1:370-377.
    [19]Anna Bosch, Xavier Munoz, Robert Marti. A review:Which is the best way to organize/classify images by content?. Image Vision Computing,2007,25(6):778-791.
    [20]Aude Oliva, Torralba A. Modeling the shape of the scene:a holistic representation of the spatial envelope. International Journal of Computer Vision,2001,42(3):145-175.
    [21]Julia Vogel, Bernt Schiele. Natural Scene Retrieval Based on a Semantic Modeling Step. In Proc. of International Conference on Image and Video Retrieval (CIVR), July 2004:207-215.
    [22]Fan J, Gao Y, Luo H. Statistical modeling and conceptualization of natural images. Pattern Recognition,2005,38:865-885.
    [23]Fredembach C, Schroder M, Susstrunk S. Eigenregions for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(12):1645-1649.
    [24]Anna Bosch, Xavier Munoz. Object and Scene Classification:what does a Supervised Approach Provide us?. In Proc. of International Conference on Pattern Recognition (ICPR), August 2006,1:773-777.
    [25]Sanjiv Kumar, Martial Herbert. Discriminative random fields:A discriminative framework for contextual interaction in classification. In Proc. of IEEE International Conference on Computer Vision (ICCV),2003,2:1150-1157.
    [26]Vailaya A, Figueiredo M, Jain A. Image classification for content-based indexing. IEEE Transactions on Image Processing,2001,10:117-129.
    [27]Fergus R, Perona P, Zisserman A. Object class recognition by unsupervised scale-invariant learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2003,2:264-271.
    [28]Dorko G, Schmid C. Selection of scale invariant parts for object class recognition. In Proc. of IEEE International Conference on Computer Vision (ICCV), October 2003,1:634-639.
    [29]Pedro Quelhas, Florent Monay. Modeling scenes with local descriptors and latent aspects. In Proc. of IEEE International Conference on Computer Vision (ICCV), October 2005,1:883-890.
    [30]Anna Bosch, Andrew Zisserman. Scene Classification Using a Hybrid Generative/Discriminative Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence,2008,30(4):712-727.
    [31]Zhang J, Marszalek M, Lazebnik S. Local features and kernels for classification of texture and object categories:A comprehensive study. International Journal of Computer Vision,2007, 73(2):213-238.
    [32]Farquhar J, Szedmak S. Improving bag of keypoints image categorization:Generative Models and PDF-Kernels. Technical report. University of Southampton, February 2005.
    [33]Agarwal A, Triggs B. Hyperfeatures-multilevel local coding for visual recognition. In Proc. of IEEE European Conference on Computer Vision (ECCV), May 2006:30-43.
    [34]David GLowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision,2003,60(2):91-110.
    [35]Koenderink, J.J. The structure of images. Biological Cybernetics,1984,50:363-396.
    [36]Lindeberg, T. Scale-space theory:A basic tool for analysing structures at different scales. Journal of Applied Statistics,1994,21(2):224-270.
    [37]Brown, M. and David G.Lowe. Invariant features from interest point groups. In Proc. of British Machine Vision Conference,2002:656-665.
    [38]K. Mikolajczyk, C. Schmid. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(10):1615-1630.
    [39]Jurie, F. and B. Triggs. Creating efficient codebooks for visual recognition. In Proc. of IEEE International Conference on Computer Vision (ICCV),2005,1:604-610.
    [40]Yang Yiming, Pedersen J. A comparative study on feature selection in text categorizaty. In Proc. of the 14 International Conference on Machine Learning (ICML),1997:412-420.
    [41]Li Feifei, P. Perona, A bayesian hierarchical model for learning natural scene categories, In Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR),2005,2:524-531.
    [42]S. Lazebnik, C. Schmid, J. Ponce. Beyond bags of features:spatial pyramid matching for recognizing natural scene categories. In Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2006,2:2169-2178.
    [43]Michael Jordan, Kleinberg J, Scholkopf B. Pattern Recognition and Machine Learning, Christopher M. Bishop,2006:76-461.
    [44]J. S. Sivic and A. Zisserman. Video google:A text retrieval approach to object matching in videos. In Proc. of International Conference on Computer Vision (ICCV),2003,2:1470-1477.
    [45]Deerwester, S., Dumais. Indexing by latent semantic analysis. Journal of the American Society for Information Science,1990,41(6):391-407.
    [46]Hofmann Thomas. Probabilistic Latent Semantic Analysis. In Proc. of Uncertainity in Artificial Intelligence,1999:289-296.
    [47]Blei David, Andrew Y., Michael Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,2003,3:993-1020.
    [48]Mark Steyvers, Tom Griffiths. Probabilistic Topic Models. Latent Semantic Analysis:A Road to Meaning. Laurence Erlbaum,2007:1-15.
    [49]Blei David, John D. Lafferty. A Correlated Topic Model of Science. The Annals of Applied Statistics,2007,1(1):17-35.
    [50]Blei David, Tom Griffiths, Michael Jordan. Hierarchical topic models and the nested Chinese restaurant process. Advances in Neural Information Processing Systems 16, Cambridge, MIT Press.2004.
    [51]Eric Nowak, Frederic Jurie, Bill Triggs. Sampling Strategies for Bag of Features Image Classification. In Proc. of IEEE European Conference on Computer Vision (ECCV), Part IV, LNCS. May 2006,3954:490-503.
    [52]Nikhil Rasiwasia, Nuno Vasconcelos. Scene Classification with Low-dimensional Semantic Spaces and Weak Supervision. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2008:1-6.
    [53]Dempster A.P, Laird N, Rubin D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. SeriesB,1977,39(l):1-38.
    [54]Sean Borman. The expectation maximization algorithm:A short tutorial. Unpublished paper available at http://www.seanborman.com/publications.2004.
    [55]Jiang, Y.G., C.W. Ngo and J. Yang. Towards optimal bag of features for object categorization and semantic video retrieval. In Proc. of the 6th ACM International Conference on Image and Video Retrieval (CIVR),2007:494-501.
    [56]C. Siagian, L. Itti, Gist:a mobile robotics application of context-based vision in out-door environment. In Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2005,3:1063-1069.
    [57]http://people.csail.mit.edu/torralba/code/spatialenvelope/
    [58]W. H. Hsu and S.F. Chang. Visual cue cluster construction via information bottleneck principle and kernel density estimation. In Proc. of ACM Conference on Image and Video Retrieval (CIVR),2005,3685:591-602.
    [59]F. Perronnin. Universal and adapted vocabularies for generic visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence,2008,30(7):1243-1256.
    [60]Van Gemert, J.C. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, July,2010,32(7):1271-1283.
    [61]Yang Y. and J. Pedersen. A comparative study on feature selection in text categorization. In Proc. of 14th International Conference on Machine Learning (ICML),1997:412-420.
    [62]Yang J., Jiang Yu-Gang, A. G. Hauptmann, Chong-Wah Ngo. Evaluating bag-of-visual-words representations in scene classification. In Proc. of the ACM International Workshop on Multimedia Information Retrieval,2007:197-206.
    [63]A. Torralba, Contextual priming for object detection, International Journal of Computer Vision, 2003,53(2):169-191.
    [64]A. Torralba, K.P. Murphy, W.T. Freeman. Contextual models for object detection using boosted random fields, in:L.K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information Processing Systems 17 (NIPS), MIT Press, Cambridge, MA,2005:1401-1408.
    [65]D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, International Journal of Computer Vision.2008,80 (1):3-15.
    [66]G. Heitz, D. Koller, Learning Spatial Context:Using Stuff to Find Things, In Proc. of IEEE European Conference on Computer Vision (ECCV),2008,5302:30-43.
    [67]He X., R.S. Zemel, M.A. Carreira-Perpinan, Multiscale conditional random fields for image labeling, In Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2004,2:695-702.
    [68]S. Kumar, M. Hebert, A hierarchical field framework for unified context-based classification, In Proc. of IEEE International Conference on Computer Vision (ICCV),2005,2:1284-1291.
    [69]T. Zhuowen, Auto-context and its application to high-level vision tasks, In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2008:1-8.
    [70]J. Jiayan, T. Zhuowen, Efficient scale space auto-context for image segmentation and labeling, In Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), USA,2009:1810-1817.
    [71]Yu J., Luo J., Leveraging probabilistic season and location context models for scene understanding, In Proc. of the ACM International Conference on Content-based Image and Video Retrieval (CIVR),2008:169-178.
    [72]G.Griffin, A.Holub, P.Perona, Caltech-256 object category dataset, Technical Report 7694, California Institute of Technology,2007. URL:/http://authors. library.caltech.edu/7694S.
    [73]Li L.J., R. Socher, Li. Feifei, Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In Proc. of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2009:2036-2043.
    [74]E. Chang, K. Goh, G. Sychay, G. Wu, Cbsa:Content-based soft annotation for multimodal image retrieval using bayes point machines. IEEE Transactions on Circuits and Systems for Video Technology Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description,2003,13 (1):26-38.
    [75]A. Vailaya, M. Figueiredo, A. Jain, H.J. Zhang, Content-based hierarchical classification of vacation images. In Proc. of IEEE International Conference on Multimedia Computing and Systems (ICMCS),1999,1:518-523.
    [76]C. Siagian, L. Itti. Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29 (2): 300-312.
    [77]R. Fergus, Li. Feifei, P. Perona, A. Zisserman. Learning object categories from google's image search. In Proc. of Tenth IEEE International Conference on Computer Vision (ICCV),2005,2: 1816-1823.
    [78]S. Agarwal, A. Awan, D. Roth. Learning to detect objects in images via a sparse, part-based representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26 (11): 1475-1490.
    [79]Luo J., A. Savakis. Indoor vs outdoor classification of consumer photographs using low-level and semantic features. In Proc. of 2001 International Conference on Image Processing (ICIP), 2001,2:745-748.
    [80]J. Vogel, B. Schiele. A semantic typicality measure for natural scene categorization. Pattern Recognition 2004,3175:195-203.
    [81]V.VaPnik. The Nature of Statistical Learning Theory. SPringer-Verlag, NY, USA, Zthedition, 2000.
    [82]边肇棋,张学工.模式识别.第2版.北京:清华大学出版社,2000.
    [83]Richard O.Duda著,李宏东,姚天翔译.模式分类.北京:机械工业出版社.2003.
    [84]Torralba A.B, Murphy K.P. Context-based vision system for place and object recognition. In Proc. of IEEE International Conference on Computer Vision (ICCV),2003,2:273-283.
    |85] Csurka G, Bray C, Dance C. Visual Categorization with Bags of Keypoints. In Proc. of IEEE European Conference on Computer Vision (ECCV), Feb 2004:1-22.
    [86]Li Wenbo, Sun Le. Text Classification Based on Label-LDA Model. Chinese Journal of Computers,2008,31(4):620-627.
    [87]高隽,谢昭.图像理解理论与方法.科学出版社.2009,10:399-430.
    [88]冯松鹤.面向感知的图像检索及自动标算法研究.北京交通大学,博士学位论文.2008.
    [89]韩东峰.图像分类识别中特征及模型的若干问题研究.吉林大学.博士学位论文.2008.
    [90]Wang Chong, Blei David, FeiFei Li. Simultaneous Image Classification and Annotation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009:1903-1910.
    [91]Mikaela Kelle. Theme Topic Mixture Model for Document Representation. PASCAL Workshop on Learning Methods for Text Understanding and Mining, IDIAP-RR, IDIAP, January 2004.
    [92]James Philbin, Josef Sivic, Andrew Zisserman. Geometric LDA:A Generative Model for Particular Object Discovery. Proceedings of the British Machine Vision Conference,2008.
    [93]Kevin M. An introduction to graphical models. Intel Research Technical Report. May 2001:1-19.
    [94]Varma M, Zisserman A. A statistical approach to texture classification from single images. International Journal of Computer Vision,2005,62(1-2):61-81.
    [95]Larlus D, Jurie F. Latent mixture vocabularies for object categorization and segmentation. Journal of Image and Vision Computing,2009,27(5):523-534.
    [96]Pitman E. Sufficient statistics and intrinsic accuracy. Mathematical Proceedings of the Cambridge Philosophical Society, May 1936,32:567-579.
    [97]Lu Zhiwu, Peng Yuxin, Horace H.S. Image categorization via robust pLSA. Pattern Recognition Letters,2010,31(1):36-43.
    [98]Wang James Z., Li Jia, Gio Wiederhold. Simplicity:Semantics-Sensitive Integrated Matching for Picture Libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence,2001, 23(9):947-963.
    [99]Shen J, Shepherd J, A.H.H. Ngu. Semantic-sensitive classification for large image libraries. International Multimedia Modelling Conference, January 2005:340-345.
    [100]Anna Bosch, Andrew Zisserman. Scene Classification Via pLSA. In Proc. of IEEE European Conference on Computer Vision (ECCV), Part IV, LNCS 3954. May 2006:517-530.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700