摘要
复杂数据对象(如图片、文本)通常被表示成高维特征向量。PostgreSQL系统现有的最近邻检索方法KNN-Gist基于树状索引实现,无法高效支持高维数据的最近邻检索。引入的PostgreSQL系统高维空间近似最近邻检索插件:AKNN-Qalsh,基于位置敏感哈希机制实现,支持大规模、高维数据对象的近似最近邻检索。通过在五个真实数据集上的密集实验,验证了该插件的有效性。
Complex data objects(such as pictures, text) are usually represented as high-dimensional feature vectors. The existing nearest neighbor search method KNN-Gist in PostgreSQL is based on the tree-structured index and cannot efficiently support the nearest neighbor search of high-dimensional data. The PostgreSQL system high-dimensional approximate nearest neighbor search extension: AKNN-Qalsh is introduced, which is based on the Locality-Sensitive Hashing scheme and supports approximate nearest neighbor search of large-scale, high-dimensional data objects. The effectiveness of the extension via extensive experiments on five real data sets is demonstrated.
引文
[1] The PostgreSQL Global Development Group.PostgreSQL 9.1 Press Kit [EB/OL].(2011-09-12) [2018-5-14].https://www.postgresql.org/about/press/presskit91/.
[2] KOROTKOV A.ImgSmlr [CP/OL].(2018-02-24) [2018-5-14].https://github.com/ postgrespro/ imgsmlr.
[3] LOWE D G.Distinctive image features from scale-invariant keypoints [J].International journal of computer vision,2004,60(2):91-110.
[4] PostgreSQL 9.6.9 Documentation.Appendix F.Additional Supplied Modules [EB/OL].[2018-5-14].https://www.postgresql.org/docs/9.6/static/pgtrgm.html.
[5] WEBER R,SCHEK H J,BLOTT S.A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces [C]//VLDB.1998,98:194-205.
[6] INDYK P,MOTWANI R.Approximate nearest neighbors:towards removing the curse of dimensionality [C]//Proceedings of the thirtieth annual ACM symposium on Theory of computing.ACM,1998:604-613.
[7] PANIGRAHY R.Entropy based nearest neighbor search in high dimensions [C]//Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm.Society for Industrial and Applied Mathematics,2006:1186-1195.
[8] LV Q,JOSEPHSON W,WANG Z,et al.Multi-probe LSH:efficient indexing for high-dimensional similarity search [C]//Proceedings of the 33rd international conference on Very large data bases.VLDB Endowment,2007:950-961.
[9] TAO Y,YI K,SHENG C,et al.Efficient and accurate nearest neighbor and closest pair search in high-dimensional space [J].ACM TODS,2010,35(3):20.
[10] GAN J,FENG J,FANG Q,et al.Locality-sensitive hashing scheme based on dynamic collision counting [C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data.ACM,2012:541-552.
[11] HUANG Q,FENG J,ZHANG Y,et al.Query-aware locality-sensitive hashing for approximate nearest neighbor search [J].Proceedings of the VLDB Endowment,2015,9(1):1-12.
[12] HUANG Q,FENG J,FANG Q,et al.Query-aware locality-sensitive hashing scheme for lp norm [J].The VLDB Journal,2017,26(5):683-708.
[13] LEHMAN P L.Efficient locking for concurrent operations on B-trees [J].ACM TODS,1981,6(4):650-670.
(1)http://yann.lecun.com/exdb/mnist
(2)http://corpus-texmex.irisa.fr/
(3)http://labelme.csail.mit.edu/Release3.0/
(4)http://archive.ics.uci.edu/ml/datasets/P53+Mutants