A Text Clustering Approach of Chinese News Based on Neural Network Language Model

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

A Text Clustering Approach of Chinese News Based on Neural Network Language Model

详细信息查看全文

作者：Zhaoxin Fan ; Shuoying Chen ; Li Zha…
关键词：Data mining ; Fuzzy k ; means ; Language model ; Chinese news
刊名：International Journal of Parallel Programming
出版年：2016
出版时间：February 2016
年：2016
卷：44
期：1
页码：198-206
全文大小：391 KB
参考文献：1.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
2.Aggarwal, C.C., Zhai, C.X.: A survey of text clustering algorithms. Mining text data. Springer, US (2012)
3.Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM (2012)
4.Bengio, Y., et al.: Neural probabilistic language models. Innovations in machine learning. Springer, Berlin (2006)
5.Berkhin, P.: A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin (2006)
6.Rajaraman, A., Ullman, J.D.: Data mining. Mining of massive datasets. Cambridge University Press, Cambridge (2012)
7.Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46.3, 167–174 (1992)MathSciNet
8.Zeng, H.-J., et al.: CBC: clustering based text classification requiring minimal labeled data. Data mining, 2003. ICDM 2003. In: 3rd IEEE international conference on IEEE (2003)
9.Decherchi, S., et al.: A text clustering framework for information retrieval. J. Inf. Assur. Sec. 4, 174–182 (2009)
10.Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowl. Inf. Syst. 31.3, 455–474 (2012)CrossRef
11.Kang, S.-S.: Keyword-based document clustering. In: Proceedings of the 6th international workshop on information retrieval with Asian languages-Volume 11. Association for computational linguistics (2003)
12.Cheng, H.-C., Chiun-Chieh, H.S.U.: Using topic keyword clusters for automatic document clustering. IEICE Trans. Inf. Syst. 88.8, 1852–1860 (2005)CrossRef
13.Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowl. Data Eng. IEEE Trans. 19.8, 1026–1041 (2007)CrossRef
14.Berry, M.W., Castellanos, M. (eds.): Survey of text mining. Springer, New York (2004)
15.Hotho, A., Nurnberger, A., Paaß G.: A brief survey of text mining. Ldv Forum. 20, 19–62 (2005)
16.Horng, Y.-J.: A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. Fuzzy Syst. IEEE Trans. 13.2, 216–228 (2005)CrossRef
17.Tjhi, W.-C., Chen, L.: A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data. Fuzzy Sets Syst. 159.4, 371–389 (2008)MathSciNet CrossRef
18.Heinrich, G.: Parameter estimation for text analysis. Technical report (2005)
19.Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2.1, 37–63 (2011)
作者单位：Zhaoxin Fan (1)
Shuoying Chen (1)
Li Zha (2)
Jiadong Yang (3)

1. School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
3. Sohu.com Inc, Beijing, China
刊物类别：Computer Science
刊物主题：Theory of Computation
Processor Architectures
Software Engineering, Programming and Operating Systems
出版者：Springer Netherlands
ISSN：1573-7640

文摘

Text clustering plays an important role in data mining and machine learning. After years of development, clustering technology has produced a series of theories and methods. However, in the text clustering of Chinese news, the mainstream LDA method suffers a high time complex. In order to improve the speed, this paper puts forward a new method in which neural network language model is first applied to text clustering. Text clustering is first converted to its dual problem called word clustering. With neural network language model, we can get word vector which can be used in the fuzzy k-means of the Chinese news keyword set. Based on the keyword clustering result, we can get text clustering result of Chinese news by a single transition. Experiments have show this method’s running speed is five times faster than LDA. This method has been successfully used in the Sohu news recommendation system currently. Keywords Data mining Fuzzy k-means Language model Chinese news

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700