用户名: 密码: 验证码:
A Text Clustering Approach of Chinese News Based on Neural Network Language Model
详细信息    查看全文
  • 作者:Zhaoxin Fan ; Shuoying Chen ; Li Zha…
  • 关键词:Data mining ; Fuzzy k ; means ; Language model ; Chinese news
  • 刊名:International Journal of Parallel Programming
  • 出版年:2016
  • 出版时间:February 2016
  • 年:2016
  • 卷:44
  • 期:1
  • 页码:198-206
  • 全文大小:391 KB
  • 参考文献:1.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
    2.Aggarwal, C.C., Zhai, C.X.: A survey of text clustering algorithms. Mining text data. Springer, US (2012)
    3.Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM (2012)
    4.Bengio, Y., et al.: Neural probabilistic language models. Innovations in machine learning. Springer, Berlin (2006)
    5.Berkhin, P.: A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin (2006)
    6.Rajaraman, A., Ullman, J.D.: Data mining. Mining of massive datasets. Cambridge University Press, Cambridge (2012)
    7.Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46.3, 167–174 (1992)MathSciNet
    8.Zeng, H.-J., et al.: CBC: clustering based text classification requiring minimal labeled data. Data mining, 2003. ICDM 2003. In: 3rd IEEE international conference on IEEE (2003)
    9.Decherchi, S., et al.: A text clustering framework for information retrieval. J. Inf. Assur. Sec. 4, 174–182 (2009)
    10.Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowl. Inf. Syst. 31.3, 455–474 (2012)CrossRef
    11.Kang, S.-S.: Keyword-based document clustering. In: Proceedings of the 6th international workshop on information retrieval with Asian languages-Volume 11. Association for computational linguistics (2003)
    12.Cheng, H.-C., Chiun-Chieh, H.S.U.: Using topic keyword clusters for automatic document clustering. IEICE Trans. Inf. Syst. 88.8, 1852–1860 (2005)CrossRef
    13.Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowl. Data Eng. IEEE Trans. 19.8, 1026–1041 (2007)CrossRef
    14.Berry, M.W., Castellanos, M. (eds.): Survey of text mining. Springer, New York (2004)
    15.Hotho, A., Nurnberger, A., Paaß G.: A brief survey of text mining. Ldv Forum. 20, 19–62 (2005)
    16.Horng, Y.-J.: A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. Fuzzy Syst. IEEE Trans. 13.2, 216–228 (2005)CrossRef
    17.Tjhi, W.-C., Chen, L.: A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data. Fuzzy Sets Syst. 159.4, 371–389 (2008)MathSciNet CrossRef
    18.Heinrich, G.: Parameter estimation for text analysis. Technical report (2005)
    19.Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2.1, 37–63 (2011)
  • 作者单位:Zhaoxin Fan (1)
    Shuoying Chen (1)
    Li Zha (2)
    Jiadong Yang (3)

    1. School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
    2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
    3. Sohu.com Inc, Beijing, China
  • 刊物类别:Computer Science
  • 刊物主题:Theory of Computation
    Processor Architectures
    Software Engineering, Programming and Operating Systems
  • 出版者:Springer Netherlands
  • ISSN:1573-7640
文摘
Text clustering plays an important role in data mining and machine learning. After years of development, clustering technology has produced a series of theories and methods. However, in the text clustering of Chinese news, the mainstream LDA method suffers a high time complex. In order to improve the speed, this paper puts forward a new method in which neural network language model is first applied to text clustering. Text clustering is first converted to its dual problem called word clustering. With neural network language model, we can get word vector which can be used in the fuzzy k-means of the Chinese news keyword set. Based on the keyword clustering result, we can get text clustering result of Chinese news by a single transition. Experiments have show this method’s running speed is five times faster than LDA. This method has been successfully used in the Sohu news recommendation system currently. Keywords Data mining Fuzzy k-means Language model Chinese news

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700