摘要
从海量数据中进行近似数据的检索是数据挖掘领域许多应用的关键。尤其近年来,数据的规模出现爆炸式增长,数据检索需面对海量数据和"维度灾难"的叠加考验,这使得传统最近邻算法效率降低,而近似最近邻算法发挥了越来越重要的作用。其中哈希算法以其在存储空间和计算时间上的优势受到了广泛关注。提出了一种基于随机森林的哈希算法。该算法通过构建随机森林,将原始空间的样本映射为海明空间的二进制哈希码,并在哈希空间上定义了顺序敏感的海明距离,以最大程度保持数据在原空间的近邻关系不变。由于随机森林中不同决策树所使用的特征空间和学习过程是独立的,可以以增量的方式灵活地确定哈希码的长度。此外基于随机森林的哈希编码算法天然适合并行部署,从而可以大大提高算法速度。最后,在MNIST和CIFAR-10数据集对所提算法进行了实验验证,结果表明了算法的有效性和出色性能。
The retrieval of approximate data from massive data is the key to many applications in data mining.Especially in the recent years, the scale of data has exploded, and data retrieval needs to face the overlapping test of massive data and"dimension disaster", which makes the efficiency of traditional nearest neighbor algorithm decrease,while the approximate nearest neighbor algorithms are playing an increasingly important role. Among them, the Hashing algorithm has received extensive attention due to its advantages in storage space and computation time.This paper proposes a Hashing algorithm based on random forest. In the method, by constructing a random forest,the sample of the original space is mapped to the binary Hash code of Hamming space, and an order-sensitive Hamming distance is defined to keep the neighbor relationship of the data in the original space to the maximum extent. Since the feature space and learning process used by different decision trees in a random forest are independent,the length of the Hash code can be flexibly determined in an incremental manner. In addition, the proposed algorithm based on random forest is naturally suitable for parallel deployment, which can greatly improve the speed of the algorithm. Finally, the proposed algorithm is experimentally verified in the MNIST and CIFAR-10 data sets. The results demonstrate the effectiveness and great performance of the algorithm.
引文
[1]Andoni A,Indyk P.Near-optimal Hashing algorithms for approximate nearest neighbor in high dimensions[C]//Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science,Berkeley,Oct 21-24,2006.Washington:IEEE Computer Society,2006:459-468.
[2]Gionis A,Indyk P,Motwani R.Similarity search in high dimensions via Hashing[C]//Proceedings of the 25th International Conference on Very Large Data Bases,Edinburgh,Sep 7-10,1999.San Mateo:Morgan Kaufmann,1999:518-529.
[3]Gong Y C,Lazebnik S,Gordo A,et al.Iterative quantization:a procrustean approach to learning binary codes for largescale image retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(12):2916-2929.
[4]Salakhutdinov R,Hinton G.Semantic Hashing[J].International Journal of Approximate Reasoning,2009,50(7):969-978.
[5]Kong W H,Li W J.Isotropic Hashing[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems,Lake Tahoe,Dec 3-6,2012.Red Hook:Curran Associates,2012:1646-1654.
[6]Liu W,Mu C,Kumar S,et al.Discrete graph Hashing[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems,Montreal,Dec 8-13,2014.Cambridge:MIT Press,2014:3419-3427.
[7]Norouzi M,Fleet D J.Minimal loss Hashing for compact binary codes[C]//Proceedings of the 28th International Conference on Machine Learning,Bellevue,Jun 28-Jul 2,2011.Madison:Omnipress,2011:353-360.
[8]Kulis B,Darrell T.Learning to Hash with binary reconstructive embeddings[C]//Proceedings of the 23rd Annual Conference on Neural Information Processing Systems,Vancouver,Dec7-10,2009.Red Hook:Curran Associates,2009:1042-1050.
[9]Shen F M,Shen C H,Liu W,et al.Supervised discrete Hashing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,Jun 7-12,2015.Washington:IEEE Computer Society,2015:37-45.
[10]Liu W,Wang J,Ji R R,et al.Supervised Hashing with kernels[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Providence,Jun 16-21,2012.Washington:IEEE Computer Society,2012:2074-2081.
[11]Wang J,Kumar S,Chang S F.Sequential projection learning for Hashing with compact codes[C]//Proceedings of the27th International Conference on Machine Learning,Haifa,Jun 21-24,2010.Madison:Omnipress,2010:1127-1134.
[12]Ye G N,Liu D,Wang J,et al.Large-scale video Hashing via structure learning[C]//Proceedings of the IEEE International Conference on Computer Vision,Sydney,Dec 1-8,2013.Washington:IEEE Computer Society,2013:2272-2279.
[13]Chum O,Philbin J,Zisserman A.Near duplicate image detection:min-Hash and TF-IDF weighting[C]//Proceedings of the British Machine Vision Conference 2008,Leeds,Sep1-4,2008.Durham:British Machine Vision Association,2008:1-10.
[14]Wu C X,Zhu J K,Zhang J M,et al.A convolutional treelets binary feature approach to fast keypoint recognition[C]//LNCS 7576:Proceedings of 12th European Conference on Computer Vision,Florence,Oct 7-13,2012.Berlin,Heidelberg:Springer,2012:368-382.
[15]Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32.
[16]Liu F T,Ting K M,Yu Y,et al.Spectrum of variablerandom trees[J].Journal of Artificial Intelligence Research,2008,32(1):355-384.
[17]Zhou Z H,Feng J.Deep forest:towards an alternative to deep neural networks[J].arXiv:1702.08835,2017.
[18]Feng J,Zhou Z H.AutoEncoder by forest[J].arXiv:1709.09018,2017.
[19]LeCun Y.The MNIST database of handwritten digits[EB/OL].[2012-01-20].http://yann.lecun.com/exdb/mnist.
[20]Krizhevsky A,Hinton G.Learning multiple layers of features from tiny images[R].Toronto:University of Toronto,2009.
[21]Lin Y.Hashing based algorithms for nearest neighbor search in high dimensions[D].Hangzhou:Zhejiang University,2013.
[22]Li W J,Zhou Z H.Learning to Hash for big data:current status and future trends[J].Chinese Science Bulletin,2015,60(5/6):485-490.
[23]Bentley J L.Multidimensional binary search trees used for associative searching[J].Communications of the ACM,1975,18(9):509-517.
[24]Guttman A.R-trees:a dynamic index structure for spatial searching[C]//Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data,Boston,Jun18-21,1984.New York:ACM,1984:47-57.
[25]Quinlan J R.Induction on decision tree[J].Machine Learning,1986,1(1):81-106.
[26]Quinlan J R.C4.5:programs for machine learning[M].New York:Elsevier Science Inc,2014.
[27]Breiman L I,Friedman J H,Olshen R A,et al.Classification and regression trees(CART)[J].Encyclopedia of Ecology,1984,40(3):582-588.
[28]Yu G,Yuan J S.Scalable forest Hashing for fast similarity search[C]//Proceedings of the IEEE International Conference on Multimedia and Expo,Chengdu,Jul 14-18,2014.Washington:IEEE Computer Society,2014:1-6.
[29]Oliva A,Torralba A.Modeling the shape of the scene:a holistic representation of the spatial envelope[J].International Journal of Computer Vision,2001,42(3):145-175.
[21]林悦.基于哈希算法的高维数据的最近邻检索[D].杭州:浙江大学, 2013.
[22]李武军,周志华.大数据哈希学习:现状与趋势[J].科学通报, 2015, 60(5/6):485-490.