大规模图像库的高维索引技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
高维数据的索引机制是大规模图像库的基于内容检索能够达到实时性要求的关键技术。面临“维度灾难”带来的影响,如何通过索引的表示、索引的组织和索引的提取提高高维图像数据的检索效率是高维索引研究的关键问题。本文主要针对索引的表示和索引的组织进行了研究,提出了一系列简单可行的索引方法。
     在综合研究现有高维索引技术的基础上,针对高维索引表示和组织的关键问题:索引剪枝过滤策略、高维向量近似表示和快速近似检索算法,详细讨论了目前已提出的索引方法的局限性,并设计了改进方法。高维主存索引作为高维索引的未来发展方向,引起了很多学者的关注。对高维主存索引结构的研究可以为基于磁盘的高维索引结构的研究提供新的思路,文中设计了一种新的高维主存索引结构。
     为了减少大规模图像数据库在检索过程中引入的误中点个数,本文在定义向量排序和活性维等概念的基础上,提出了一种新的索引快速剪枝过滤技术。该技术采用分段处理思想,实现非候选节点以序列方式和以点方式的两阶段剪枝过滤,从而快速排除所有的误中点,尽可能减少距离计算次数,实现大规模高维向量空间的快速范围查询。该技术适用于目前已提出的基于一维转换思想的高维索引结构中,如金字塔技术,可以提高这类索引机制的检索效率。
     对数据空间的有效划分是高维向量近似表示的前提,结合近似向量表示和一维转换两种索引构造思想,提出基于位码和距离的高维向量压缩表示形式,实现高维向量的二维表示形式。检索时采用两层过滤技术,可以显著减少检索需要访问的数据向量的个数。实验证明,这种两种索引机制相结合的方法取得了比单独的索引技术更好的性能。
     基于高维向量压缩表示的索引构造思想,提出一种简单有效的KNN检索算法。通过聚类将数据划分成多个子集空间,对每个聚类子集内的高维向量,利用距离和位码定义简化表示形式。KNN搜索时,在不需要计算向量距离的情况下,根据部分维的位码不相同信息的比较,即容易实现的字符串比较,将某些非候选节点迅速过滤,以此减少高维向量距离计算次数。该方法可以大大降低利用索引进行相似性检索的CPU代价,达到快速检索的目的。
     主存技术的不断进步,使得主存多媒体数据库的实现成为可能。研究表明,主存多媒体数据库系统性能深受处理器缓存未命中的影响,缓存感知型主存索引是提高数据检索效率的有效手段。针对SA-Tree不适用于主存存取的缺点,提出它的变体CSA-Tree。CSA-Tree利用PCA降维技术,将树的各层节点采用不同的维度表示,这样不仅提高缓存空间的利用率,还降低了CPU负载,从而提高了索引查询效率。
High dimensional index technique is one significant research issue for content-based image real-time retrieval in large-scale image databases. Facing the influence of the“curse of dimensionality”, how to speed up retrieval in image databases with rational design of the expression、organization and extraction of the index is key problem in high-dimensional indexing schemes. The issues of the expression and organization of high-dimensional index are discussed in this paper and a series of simple and feasible index methods are proposed.
     Based on the research of expression and organization of current index methods, some key issues such as pruning and filtering of index tree, approximation presentation of high-dimensional vector and algorithm for fast similarity retrieval are discussed in detail, especially on their disadvantage, based on which some improvement approaches are proposed. With the development of high-dimensional index, high-dimensional main memory index arouses more and more interest, which techniques will be benefit to the research on the disk-based high-dimensional index. A new high-dimensional main memory index is designed in this paper.
     With large amount of points in the image databases, the retrieval performance degrades dramatically because the cost of distance computation with high dimensions increases greatly with many false hit points brought in the process of the range query. In this dissertation, a fast pruning and filtering technique is proposed based on the basis of the introduced concepts of vector ordering and active dimension. This method can solve two phases of pruning: one is in a sequence way; the other is in a point way, such that it can reduce the range that needed to search. By incorporating it into the Pyramid-Technique (PT), we can improve the retrieval efficiency of the PT.
     Efficient partition of data space is the basis of approximation presentation of a high-dimensional vector. Based on the techniques of approximate vector presentation and one dimensional transformation, an optimal compressive presentation of high-dimensional vector is proposed, called bit-code and distance based index. This representation enables two levels filtering. Experiments on large-scale image databases demonstrate a remarkable reduction of the amount of accessed vectors in similarity searches compared with the existing index schemes.
     To support efficient querying and retrieval on large-scale image databases, we propose a methodology exploring Local Bit-code Difference (LBD). On clustering the data space into a number of partitions, LBD extracts a distance and a simple bitmap representation called Bit Code (BC) for each point in the database with respect to the corresponding reference point. Pruning during KNN search is performed by dynamically selecting only a subset of the bits from the BC based on which subsequent comparisons are performed. In doing so, expensive operations involved in computing L-norm distance functions between high-dimensional data can be avoided.
     In main-memory databases, the number of processor cache misses has a critical impact on the performance of the system. Cache-conscious indices are designed to improve performance by reducing the number of processor cache misses that are incurred during a search operation. Considering the disadvantage of SA-Tree inefficient for main memory access, we present its variant called CSA-Tree, which is a multi-level structure where each level of the tree represents the data space at different dimensionalities by using Principal Component Analysis. Each level of the tree serves to prune the search space more efficiently as the reduced dimensions can better exploit the small cache line size. Moreover, the distance computation on lower dimensionality is less expensive. We conducted extensive experiments to evaluate the proposed structures against other methods. Our results show that the CSA-Tree is superior in most cases.
引文
[1] C. Faloutsos, R. Barber, M. Flickner, J. Hafner. Efficient and effective querying by image content. Journal of Intelligent Information Systems 3, 1994: 231-262
    [2] Y. Rui, T. Huang, S. Chang. Image retrieval: current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 1999, 10: 1-23
    [3] T. Seidl, H. -P. Kriegel. Efecient user-adaptable similarity search in large multimedia databases. In Proc. 23rd Int. Conf. on Very Large Databases, Athens, 1997
    [4] Y. X. Chen, Z. W. James, K. Robert. Content-based image retrieval by clustering. IN: International Multimedia Conference Proceedings of the 5th ACM SIGMM international workshop, 2003: 193-200
    [5] A. Taza, C. Y. Suen. Discrimination of planar shapes using shape matrices, IEEE Trans. Syst., Man. Cybern., 1989, 19(5): 1281-1289
    [6] H. Jagadish. A retrieval technique for similar shapes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1991: 208-217
    [7] J. Hafner, H. S. Sawhney et. al. Efficient color histogram indexing for quadratic form distance functions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1995, 17(7): 729-736
    [8] R. Mehrotra, J. Gary. Feature-based retrieval of similar shapes. In Proc. 9th Int. Conf. on Data Engineering, 1993
    [9] R. Mehrotra, J. Gary. Feature-index-based similar shape retrieval. In Proc. 3rd Working Conf. on Visual Database Systems, 1995
    [10] M. Stricker, A. Dimai. Color indexing with weakspatial constraints. IS&T/SPIE Conf. on Storage and Retrieval for Image and Video Databases IV, San Jose, CA: 1996: 29-40
    [11] M. Stricker, M. Orengo. Similarity of color images. IS&T/SPIE Conf. on Storage and Retrieval for Image and Video Databases III, Vol. 2420, San Jose, CA: Feb. 1995: 381-392
    [12] B. Shoichet, D. Bodian, I. Kuntz. Molecular docking using shape descriptors. Journal of Computational Chemistry, 1992, 13(3): 380-397
    [13] W. Hsu, T. Chua, H. Pung. An integrated color-spatial approach to content-based image retrieval. Proc. of the ACM MM Conf., San Francisco, CA: Nov. 1995: 305-313
    [14] M. J. Swain, D. H. Ballard. Color indexing. International Journal of Computer Vision, 1991, 7(1): 11-32
    [15]孟繁杰,郭宝龙. CBIR关键技术研究.计算机应用研究, 2004(7): 21-26
    [16]章毓晋,刘忠伟.基于HSI模型和累积直方图的彩色图像检索.第八届全国信号处理学组委员会联合会议论文集, 1997: 256-260
    [17] M. Flickner. Query by image and video content: The QBIC system. IEEE Computer, 1995, 28(9): 23-32
    [18] J. Smith, S-F. Chang. VisualSEEK: A fully automated content-based image query system. In: Proc of the 4th ACM Multimedia Conference. Boston, 1996: 87-98
    [19] Y. A. Aslandogan, C. T. Yu. Techniques and systems for image and video retrieval. IEEE Transactions on Knowledge and Data Engineering, 1999, 11(1): 56 - 63
    [20] W. Y. Ma, B. S. Manjunath. NeTra: A toolbox for navigating large image databases. Multimedia Systems 1999, 7: 184-198
    [21]李向阳,鲁东明,潘云鹤,基于色彩的图像数据库检索方法的研究.计算机研究与发展, 1999,36(3): 359-363
    [22]曹莉华,柳伟,李国辉.基于多种主色调的图像检索算法研究与实现.计算机研究与发展, 1999, 36(1): 96-100
    [23]吴洪,卢汉清,马颂德.基于内容图像检索中相关反馈技术的回顾.计算机学报, 2005, 12(28): 1769-1779
    [24] B. Shoichet, D. Bodian, I. Kuntz.. Molecular docking using shape descriptors. Journal of Computational Chemistry, 1992, 13(3): 380-397
    [25] S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403-410
    [26] R. Agrawal, C. Faloutsos, A. Swami. Efficient similarity search in sequencedatabases. In Proc. 4th Int. Conf. on Foundations of Data Organization and Algorithms, LNCS 730, 1993: 69-84
    [27] R. Agrawal, K. Lin, H. Sawhney, K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proc. 21st Int. Conf. on Very Large Databases, 1995: 490-501
    [28] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, Z. Protopapas. Fast nearest neighbor search in medical image databases. In Proc. 22nd Int. Conf. on Very Large Databases, Mumbai, India, 1996: 215-226
    [29] H. Shawney, J. Hafner. Efficient color histogram indexing. In Proc. Int. Conf. on Image Processing, 1994: 66-70
    [30] K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4): 377-440
    [31] T. Wallace, P. Wintz. An efficient three-dimensional aircraft recognition algorithm using normalized fourier descriptors. Computer Graphics and Image Processing, 1980: 99-126
    [32] A. Bimbo. Visual Information Retrieval. Morgan Kaufmann, Inc., 1999
    [33] J. F. Manuel, A. J. Joaquim. Indexing high-dimensional data for content-based retrieval in large databases. Proceedings of the Eighth International Conference on Database Systems for Advanced Applications (DASFAA’03), 2003: 267-274
    [34] R. H. Gisli, S. Hanan. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 2003, 28(4): 517-580
    [35] J. Nievergelt, H. Hinterberger, K. Sevcik. The gridfile: an adaptable symmetric multikey file structure. ACM Transactions on Database Systems, 1984, 9(1): 38-71
    [36] J. Robinson. The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In Proceeding of the ACM SIGMOD International Conference on Management of Data, 1981: 10-18
    [37] R. Finkel, J. Bentley. Quad-trees: a data structure for retrieval on composite keys. ACTA Informatica, 1974, 4(1): 1-9
    [38] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. of the ACM IGMOD Int. Conf. on Management of Data, Boston, MA, June 1984: 47-57
    [39] T. Sellis, N. Roussopoulos, C. Faloustos. The R+-tree: a dynamic index for multi-dimensional objects. In Proc. of the Int. Conference on Very Large Databases, Brighton, England, 1987: 507-518
    [40] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proceeding of the ACM SIGMOD International Conference on Management of Data, 1990: 322-331
    [41] S. Berchtold, D. Keim, H. -P. Kriegel. The X-tree: an index structure for high-dimensional data. In Proc. of the Int. Conference on Very Large Databases, 1996: 28-39
    [42] N. Katayama, S. Satoh. The SR-tree: an index structure for high-dimensional nearest neighbor queries. In Proceeding of the ACM SIGMOD International Conference on Management of Data, 1997: 369-380
    [43] P. Ciaccia, M. Patella, P. Zezula. M-tree: an efficient access method for similarity search in metric spaces. In Proc. of the Int. Conference on Very Large Databases, Athens, Greece, 1997
    [44] K. -I. Lin, H. Jagadish, C. Faloutsos. The TV-tree: an index structure for high-dimensional data. The VLDB Journal, Oct. 1994, 3(4): 517-549
    [45] D. Lomet. The hB-tree: A multiattribute indexing method with good guaranteed performance. ACM Transactions on Database Systems, December 1990, 15(4): 625-658
    [46]倪巍伟,孙志挥,陆介平.高维空间K领域局部密度聚类算法.计算机研究与发展, 2005, 42(5): 784-791
    [47] S. Berchtold, A. Keim. High-dimensional index structures: Database support for next decade's applications. Tutorial Notes: ACM SIGMOD-98 Conference on Management of Data, Seattle, 1998
    [48] S. Berchtold, C. Bohm, H. -P. Kriegel. The pyramid-technique: towards breaking the curse of dimensionality. In Proceedings of ACM SIGMOD, Seattle, WA, 1998: 142-153
    [49]董道国,刘振中,薛向阳. VA-Trie:一种用于近似K近邻查询的高维索引结构.计算机研究与发展, 2005, 42(12): 2213-2218
    [50] R. Weber, H. Schek, S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceeding of ACM VLDB'98, 1998
    [51] K. Beyer, J. Goldstein, R. Ramakrishnan. When is nearest neighbor meaningful?. In Proceeding of the 7th International Conference on Database Theory. Jerusalem, 1999: 217-235
    [52] A. Hinneburgy, C. C. Aggarwalz, D. A. Keimy, What is the nearest neighbor in high-imensional spaces?, in Proc. VLDB, 2000: 506-515
    [53] K. Chakrabarti, S. Mehrotra. Local dimensionality reduction: a new approach to indexing high dimensional spaces. In Proc. Conf. Very Large Data Bases, 2000: 89-100
    [54] H. Jin, B. Ooi, H. Shen, C. Yu, A. Zhou. An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing. In Proc. Int’l Conf. Data Eng., 2003: 87-98
    [55] K. V. R. Kanth, D. Agrawal, A. Singh. Dimensionality reduction for similarity searching in dynamic databases, in Proc. ACM SIGMOD ICMD, 1998: 166-176
    [56] C. Bin, et al. Exploring bit-difference for approximate KNN search in high-dimensional databases. In Proc. 16th Australasian Database Conference, 2005: 165-174
    [57] Y. Cui, et al. Indexing the distance: an efficient method to KNN processing. In Proc. 27th International Conference on Very Large Data Bases, 2001: 421-430
    [58] N. Roussopoulos, S. Kelley, F. Vincent. Nearest neighbor queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995: 71-79
    [59] G. Hjaltason, H. Samet. Ranking in spatial databases. In Proc. 4th Int. Symp. on Large Spatial Databases, Portland, ME, 1995: 83-95
    [60] H. Jing, C. Bin, S. Hengtao. Diagonal ordering: a new approach to high-dimensional KNN processing. IN: Fifteenth Australasian Database Conference (ADC2004), 2004: 39-47
    [61] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal. Vector approximation based indexing for non-uniform high dimensional data sets. In ACM International Conference on Information and Knowledge Management (CKIM2000). McLean: ACM Press,2000
    [62] S. Berchtold, D. Keim, H. -P. Kriegel, T. Seidl. Indexing the solution space: a new technique for nearest neighbor search in high-dimensional space. IEEE Trans. on Knowledge and Data Engineering, 2000
    [63] J. Rao, K. Ross. Making B + -Trees cache conscious in main memory. In Proc. of the ACM SIGMOD Conference, Dallas, June, 2000: 194-205
    [64] A. Ailamaki, D. J. DeWitt, M. D. Hill, D. A. Wood. DBMSs on a modern processor: where does time go?In Proc. 25th VLDB Conference, Edinburgh, Sep. 1999: 266-277
    [65] P. A. Boncz, S. Manegold, M. L. Kersten. Database architecture optimized for the new bottleneck: memory access. In Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, 1999: 54-65
    [66] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, D. B. Lomet. AlphaSort: a cache-sensitive parallel external sort. VLDB, 1995: 603-627
    [67] A. Shatdal, C. Kant, J. F. Naughton. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, 1994: 510-521
    [68] J. Rao, K. A. Ross. Cache conscious indexing for decision-support in main memory. In Pro of the 25th VLDB Conference, Edinburgh, 1999: 78-89
    [69] S. Chen, P. B. Gibbons, T. C. Mowry. Improving index performance through prefetching. In Proc. of the ACM SIGMOD Conference, 2001: 139-150
    [70] P. Bohannom, P. Mcllroy, R. Rastogi Main-memory index structures with fixed-size partial keys. In Proc. of the ACM SIGMOD Conference, Santa Barbara, 2001: 163-174
    [71] K. Kim, S. K. Cha, K. Kwon. Optimizing multidimensional index trees for main memory access. In Proc. of the ACM SIGMOD Conference, 2001: 139-150
    [72] R. Zhang, B. C. Ooi, K. L. Tan, : Making the Pyramid Technique Robust to Query Types and Workloads, ICDE 2004: 313-324
    [73] Y. Cui. Querying high-dimensional data in single-dimensional space. The VLDB Journal, 2002; 1-15
    [74] D. Abel, J. Smith. A data structure and algorithm based on a linear key for a rectangle retrieval problem. Computer Vision, 1983, 24: 1-13
    [75] G. Morton. A computer oriented geodetic data base and a new technique in file sequencing. IBM Ltd., USA., 1966
    [76] J. Orenstein. A comparison of spatial query processing techniques for native and parameter spaces. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1990: 326-336
    [77] J. Orenstein, T. Merret. A class of data structures for associative searching. In Proc. 3rd ACM SIGACT-SIGMOD Symp. on Priciples of Database Systems, 1984: 181-190
    [78] C. Faloutsos. Multiattribute hashing using gray codes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1985: 227-238
    [79] C. Faloutsos. Gray codes for partial match and range queries. IEEE Trans. On Software Engineering, 1988, 14: 1381-1393
    [80] C. Faloutsos, S. Roseman. Fractals for secondary key retrieval. In Proc. 8th ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 1989: 247-252
    [81] G. -H. Cha, X. Zhu, D. Petkovic, C. -W. Chung. An indexing method for nearest neighbor searches in high-dimensional image databases. IEEE Transactions on Multimedia, 2002, 4(1): 76-87
    [82] Y. Sakurai, M. Yoshikawa, S. Uemura. The a-tree: an index structure for high-dimensional spaces using relative approximation. In VLDB, 2000: 516-526
    [83] Corel Image Features. available from http: //kdd. ics. uci. edu
    [84] G. Navarro, Searching in metric spaces by spatial approximation. In Proceedings String Processing and Information Retrieval and International Workshop on Groupware. Mexico, 1999: 141-148
    [85] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986
    [86] T. Hastie, W. Stuetzle. Principal curves. J. Am. Stat. Assoc., 1989, 84: 502-516
    [87] M. A. Carreira-Perpinan. A survey of dimension reduction techniques. Technical report, Center for Applied Scientific, Lawrence Livermore National Laboratory, 2002
    [88] M. A. Carreira-Perpinan. A review of dimension reduction techniques. Technicalreport CS-96-09, Department of Computer Science, University of Sheffield, 1997
    [89] A. Hyvarinen. Survey on independent component analysis. Neural Computing Surveys, 1999, 2: 94-128
    [90] J. Karhunen. Nonlinear independent component analysis. In S. Roberts and R. Everson, editors, Independent Component Analaysis: Principles and Practice. Cambridge University Press, Cambridge, UK, 2000
    [91] N. Kambhaltla and T. K. Leen. Fast non-linear dimension reduction. In Advances in Neural Information Processing Systems, 1994: 152-159
    [92] M. Wolf, M. Lam. A data locality optimizing algorithm. In Proceeding of the 1991 ACM Symposium on Programing Languages Design and Implementation, ACM, 1991: 30-44
    [93] R. Uhlig, D. Nagle, T. Stanley, T. Mudge, S. Secjrest, R. Brown. Design tradeoffs for software-managed TLBs. ACM Transactions on Computer Systems, 1994, 12(3): 175-205
    [94] O. Temam, C. Fricker, W. Jakby. Influence of cross-interferences on blocked loops: A case study with matrix-vector multiply. ACM Transactions on Programming Languages and Systems, 1995, 17(4): 561-575
    [95] J. A. Spierenburg. Dimension reduction of images using neural networks. Master's thesis, Leiden University, 1997
    [96] O. Temam, C. Fricker, W. Jakby. Cache interference phenomena. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1994: 261-271
    [97] M. Martonosi, A. Gupta, T. Anderson. Tuning memory performance of sequential and parallel programs. Computer, 1995, 28(4): 32-40
    [98] K. Kennedy, K. McKinley. Optimizing for parallelism and data locality. In Proceedings of the 1992 International Conference on Supercomputing, 1992: 323-334
    [99] T. Johnson. Characterizing the performance of algorithms for lock-free objects. IEEE Transactions on Computers, 1995, 44(10): 1194-1207
    [100] D. Goldberg. Genetic algorithms in search, optimization, and machine learning. Addison Wesley, Reading, MA, 1989
    [101] M. L. Raymer et al. Dimensionality reduction using genetic algorithms. IEEE Transactions on Evolutionary Computation, 2000, 4(2): 164-171
    [102] R. A. Hankins, J. M. Patel. Effect of nodes size on the performance of cache-conscious B+-Trees. In Proc. of the SIGMETRICS Conference, 2003: 283-294
    [103] G. Navarro. Searching in metric spaces by spatial approximation. In: Proceedings of String Processing and Information Retrieval and International Workshop on Groupware, Cancun, Mexico, 1999: 141-148
    [104] G.. Navarro, Reyes N.. Dynamic spatial approximation trees. In: Proceedings of the XXI Conference of the Chilean Computer Science Society, Punta Arenas, Chile, 2001: 213-222
    [105] E. Chavez, G. Navarro, R. b. Yates, J. L. Marroquin. Searching in Metric Sapces. ACM Computing Surveys, 2001, 33(3): 273-321
    [106] T. Bozkaya, Ozsoyoglu M.. Distance-based indexing for high-dimensional metric spaces. In: Proceedings of the ACM SIGMOD Conference, Tucson, Arizona, 1997: 357- 368

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700