A Document Modeling Method Based on Deep Generative Model and Spectral Hashing

详细信息查看全文

关键词：Spectral hashing ; Document modeling ; Deep generative model ; Hamming distance ; Codeword
刊名：Lecture Notes in Computer Science
出版年：2016
出版时间：2016
年：2016
卷：9983
期：1
页码：402-413
全文大小：2,811 KB
参考文献：1.Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRef
2.Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef
3.Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)
4.David, M.B., Andrew, Y.N., Michael, I.J.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
5.Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1711–1800 (2002)MathSciNet CrossRef MATH
6.Hinton, G.E., Osindero, S.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNet CrossRef MATH
7.Xu, J., Li, H., Zhou, S.: An overview of deep generative models. IETE Techn. Rev. 32(2), 131–139 (2015)CrossRef
8.Li, J., Luong, M.T., Dan, J.: A hierarchical neural autoencoder for paragraphs and documents. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1106–1115. Association for Computational Linguistics, Stroudsburg (2015)
9.Le, Q.V., Tomas, M.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1188–1196 (2014)
10.Salakhutdinov, R.R., Hinton, G.E.: Semantic hashing. Int. J. Approximate Reasoning 50(7), 969–978 (2009)CrossRef
11.Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1753–1760 (2009)
12.Yu, G., Sapiro, G., Mallat, S.: Solving inverse problems with piecewise linear estimators: from Gaussian mixture models to structured sparsity. IEEE Trans. Image Process. 21(5), 2481–2499 (2012)MathSciNet CrossRef
13.Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (1997)
14.Kannan, R., Vempala, S., Vetta, A.: On clusterings-good, bad and spectral. J. ACM 51(3), 497–515 (2004)MathSciNet CrossRef MATH
15.Andrew, Y.N., Michael, I.J., Yair, W.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856 (2002)
16.Xu, J., Li, H., Zhou, S.: Improving mixing rate with tempered transition for learning restricted Boltzmann machines. Neurocomputing 139, 328–335 (2014)CrossRef
17.Bekkerman, R., Yaniv, R.E., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 146–153. ACM, New York (2001)
18.Li, B., Vogel, C.: Improving multiclass text classification with error-correcting output coding and sub-class partitions. Adv. Artif. Intell. 6085, 4–15 (2010)
19.Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)CrossRef MATH
作者单位：Hong Chen (15)
Jungang Xu (15)
Qi Wang (15)
Ben He (15)

15. University of Chinese Academy of Sciences, Beijing, China
丛书名：Knowledge Science, Engineering and Management
ISBN：978-3-319-47650-6
刊物类别：Computer Science
刊物主题：Artificial Intelligence and Robotics
Computer Communication Networks
Software Engineering
Data Encryption
Database Management
Computation by Abstract Devices
Algorithm Analysis and Problem Complexity
出版者：Springer Berlin / Heidelberg
ISSN：1611-3349
卷排序：9983

文摘

One of the most critical challenges in document modeling is the efficiency of the extraction of the high level representations. In this paper, a document modeling method based on deep generative model and spectral hashing is proposed. Firstly, dense and low-dimensional features are well learned from a deep generative model with word-count vectors as its input. And then, these features are used for training a spectral hashing model to compress a novel document into compact binary code, and the Hamming distances between these codewords correlate with semantic similarity. Taken together, retrieving similar neighbors is then done simply by retrieving all items with codewords within a small Hamming distance of the codewords for the query, which can be exceedingly fast and shows superior performance compared with conventional methods as well as guarantees accessibility to the large-scale dataset.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700