A LDA-Based Algorithm for Length-Aware Text Clustering
详细信息    查看全文
  • 作者:Xinhuan Chen (19)
    Yong Zhang (19)
    Yanshen Yin (19)
    Chao Li (19)
    Chunxiao Xing (19)
  • 关键词:text clustering ; topic model ; K ; means ; unsupervised learning
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2014
  • 出版时间:2014
  • 年:2014
  • 卷:8709
  • 期:1
  • 页码:503-510
  • 全文大小:288 KB
  • 参考文献:1. Hearst, M.A.: TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics?23(1), 33-4 (1997)
    2. Tagarelli, A., Karypis, G.: Document Clustering: The Next Frontier. Data Clustering: Algorithms and Applications?305 (2013)
    3. Charu, C.A., ChengXiang, Z.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77-28. Springer US (2012)
    4. Karypis, G.: CLUTO-a clustering toolkit. Minnesota Univ. Minneapolis Dept. of Computer Science (2002)
    5. Ponti, G., Tagarelli, A., Karypis, G.: A statistical model for topically segmented documents. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol.?6926, pp. 247-61. Springer, Heidelberg (2011) CrossRef
    6. Du, L., Buntine, W.L., Jin, H.: A segmented topic model based on the two-parameter Poisson-Dirichlet process. Machine Learning?81(1), 5-9 (2010) CrossRef
    7. Du, L., Buntine, W.L., Jin, H.: Sequential latent dirichlet allocation: Discover underlying topic structures within a document. In: ICDM, pp. 148-57. IEEE (2010)
    8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation[J]. The Journal of Machine Learning Research?3, 993-022 (2003)
    9. Ma, P., Zhang, Y.: MAKM: A MAFIA-Based k-Means Algorithm for Short Text in Social Networks. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part II. LNCS, vol.?7826, pp. 210-18. Springer, Heidelberg (2013) CrossRef
    10. Jin, O., Liu, N.N., Zhao, K., et al.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the CIKM, pp. 775-84. ACM (2011)
    11. Xuan-Hieu, P., Dieu-Thu, L., et al.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the WWW, pp. 91-00. ACM (2008)
    12. Xuan-Hieu, P., Cam-Tu, N., Dieu-Thu, L., et al.: A hidden topic-based framework towards building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering?27 (2010)
    13. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the AAAI, pp. 775-80. AAAI Press (2006)
    14. Xue, G.R., Dai, W., Yang, Q., et al.: Topic-bridged PLSA for cross-domain text classification. In: Proceedings of the SIGIR, pp. 627-34. ACM (2008)
    15. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the SIGIR, pp. 787-88. ACM (2007)
    16. Wang, Y., Jia, Y., Yang, S.: Short documents clustering in very large text databases. In: Feng, L., Wang, G., Zeng, C., Huang, R. (eds.) WISE 2006 Workshops. LNCS, vol.?4256, pp. 83-3. Springer, Heidelberg (2006) CrossRef
  • 作者单位:Xinhuan Chen (19)
    Yong Zhang (19)
    Yanshen Yin (19)
    Chao Li (19)
    Chunxiao Xing (19)

    19. Research Institute of Information Technology, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
  • ISSN:1611-3349
文摘
The proliferation of texts in Web presents great challenges on knowledge discovery in text collections. Clustering provides us with a powerful tool to organize the information and recognize the structure of the information. Most text clustering techniques are designed to deal with either long or short texts. However many real-life collections are often made up of both long and short texts, namely mixed length texts. The current text clustering techniques are unsatisfactory, for they don’t distinguish the sparseness and high dimension of the mixed length texts. In this paper, we propose a novel approach - Length-Aware Dual Latent Dirichlet Allocation (ADLDA), which is used for clustering the mixed length texts via obtaining auxiliary knowledge from long (short) texts for short (long) texts in the collections. The degree of mutual auxiliary is based on the ratio of long texts and short texts in a corpus. Experimental results on real datasets show our approach achieves superior performance over other state-of the-art text clustering approaches for mixed length texts.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700