Short Text Hashing Improved by Integrating Multi-granularity Topics and Tags
详细信息    查看全文
  • 作者:Jiaming Xu (14)
    Bo Xu (14)
    Guanhua Tian (14)
    Jun Zhao (14)
    Fangyuan Wang (14)
    Hongwei Hao (14)

    14. Institute of Automation
    ; Chinese Academy of Sciences ; 100190 ; Beijing ; P.R. China
  • 关键词:Similarity Search ; Hashing ; Topic Features ; Short Text
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2015
  • 出版时间:2015
  • 年:2015
  • 卷:9041
  • 期:1
  • 页码:444-455
  • 全文大小:358 KB
  • 参考文献:1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2006, pp. 459鈥?68. IEEE (2006)
    2. Belkin, M., Niyogi, P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15: pp. 1373-1396 CrossRef
    3. Blei, D.M., Ng, A.Y., Jordan, M.I. (2003) Latent dirichlet allocation. The Journal of Machine Learning Research 3: pp. 993-1022
    4. Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 1776鈥?781. AAAI Press (2011)
    5. Cheng, X., Lan, Y., Guo, J., Yan, X.: Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 1 (2014)
    6. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: CIKM, pp. 775鈥?84. ACM (2011)
    7. Kononenko, I. Estimating attributes: analysis and extensions of relief. In: Bergadano, F., Raedt, L. eds. (1994) Machine Learning: ECML-94. Springer, Heidelberg, pp. 171-182 CrossRef
    8. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, Citeseer (1995)
    9. Lin, G., Shen, C., Suter, D., van den Hengel, A.: A general two-step approach to learning-based hashing. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2552鈥?559. IEEE (2013)
    10. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91鈥?00. ACM (2008)
    11. Salakhutdinov, R., Hinton, G. (2009) Semantic hashing. International Journal of Approximate Reasoning 50: pp. 969-978 CrossRef
    12. Wang, Q., Zhang, D., Si, L.: Semantic hashing using tags and topic modeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 213鈥?22. ACM (2013)
    13. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems, pp. 1753鈥?760 (2009)
    14. Xu, J., Liu, P., Wu, G., Sun, Z., Xu, B., Hao, H. A fast matching method based on semantic similarity for short texts. In: Zhou, G., Li, J., Zhao, D., Feng, Y. eds. (2013) Natural Language Processing and Chinese Computing. Springer, Heidelberg, pp. 299-309 CrossRef
    15. Zhang, D., Wang, F., Si, L.: Composite hashing with multiple information sources. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 225鈥?34. ACM (2011)
    16. Zhang, D., Wang, J., Cai, D., Lu, J.: Extensions to self-taught hashing: Kernelisation and supervision. Practice 29, 聽38 (2010)
    17. Zhang, D., Wang, J., Cai, D., Lu, J. Laplacian co-hashing of terms and documents. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., R眉ger, S., Rijsbergen, K. eds. (2010) Advances in Information Retrieval. Springer, Heidelberg, pp. 577-580 CrossRef
    18. Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18鈥?5. ACM (2010)
  • 作者单位:Computational Linguistics and Intelligent Text Processing
  • 丛书名:978-3-319-18110-3
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
文摘
Due to computational and storage efficiencies of compact binary codes, hashing has been widely used for large-scale similarity search. Unfortunately, many existing hashing methods based on observed keyword features are not effective for short texts due to the sparseness and shortness. Recently, some researchers try to utilize latent topics of certain granularity to preserve semantic similarity in hash codes beyond keyword matching. However, topics of certain granularity are not adequate to represent the intrinsic semantic information. In this paper, we present a novel unified approach for short text Hashing using Multi-granularity Topics and Tags, dubbed HMTT. In particular, we propose a selection method to choose the optimal multi-granularity topics depending on the type of dataset, and design two distinct hashing strategies to incorporate multi-granularity topics. We also propose a simple and effective method to exploit tags to enhance the similarity of related texts. We carry out extensive experiments on one short text dataset as well as on one normal text dataset. The results demonstrate that our approach is effective and significantly outperforms baselines on several evaluation metrics.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700