Classification of Chinese Texts Based on Recognition of Semantic Topics

详细信息查看全文

作者：Ye-wang Chen ; Qing Zhou ; Wei Luo ; Ji-Xiang Du
关键词：BaiduBaike ; Semantics topic ; Geometric programming ; Chinese text classification
刊名：Cognitive Computation
出版年：2016
出版时间：February 2016
年：2016
卷：8
期：1
页码：114-124
全文大小：1,153 KB
参考文献：1.Hu W, Wu O, Chen Z, Fu Z. Maybank, Steve Nat. Recognition of Pornographic Web Pages by Classifying Texts and Images. IEEE Trans Pattern Anal Mach Intell. 2007;29(6):1019–34.CrossRef PubMed
2.Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.CrossRef
3.Jin-Shu S, Bo-Feng Z, Xin X. Advances in machine learning based text categorization. J Softw. 2006;17(9):1848–59.CrossRef
4.HP Zhang, HK Yu, DY Xiong, Q Liu. HHMM-based Chinese lexical analyzer ICTCLAS. Second SIGHAN workshop affiliated with 41th ACL; Sapporo Japan, July; 2003. pp 184–7.
5.Chen YW, Wang HZ, Li HB, Zhong BN, Gou J, Chen DS. A topic extraction method for Chinese web text based on BaiduBaike and text classification. J Chin Comput Syst. 2012;33(12):2605–10.
6.T Hofmann, Probabilistic latent semantic indexing. Proceedings of the twenty-second annual. International SIGIR conference on research and development in information retrieval (SIGIR-99); 1999.
7.Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
8.Zhuang FZ, Luo P, Shen ZY, He Q, Xiong Y, Shi ZZ, Xiong H. Mining distinction and commonality across multiple domains using generative model for text classification. IEEE Trans Knowl Data Eng. 2012;24(11):2025–39.CrossRef
9.Gong Z, Zhang D, Hu M. An Improved SVM algorithm for Chinese text classification. Comput Simul. 2009;7:040.
10.J He, AH Tan, CL Tan. A comparative study on Chinese text categorization methods. In PRICAI workshop on text and web mining, vol. 35; 2000.
11.X. Wan. Co-training for cross-lingual sentiment classification. In 4th international.
12.Joint Conference on Natural Language Processing. Association for Computational Linguistics; 2009. P. 235–43.
13.R Pandarachalil, S Sendhilkumar, GS Mahalakshmi. Twitter sentiment analysis for large-scale data: an unsupervised approach. Cogn Comput. 2014(4).
14.Das D, Bandyopadhyay S. Sentence-level emotion and valence tagging. Cogn Comput. 2012;4:420–35.CrossRef
15.Yazdani M, Popescu-Belisa A. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia. Artif Intell. 2013;194:176–202.CrossRef
16.C Huang, H Zhao. Which is essential for Chinese word segmentation: character versus word. In Proceedings of the 20th Pacific Asia conference on language, information and computation (PACLIC20); 2006. p. 1–12.
17.Huang C, Zhao H. Chinese word segmentation: a decade review. J Chin Inf Process. 2007;21(3):8–18.
18.Xia YQ, Wong KF, Zhang P. Toward anomalous and dynamic nature of the Chinese network chat language. J Chin Inf Process. 2007;21(3):83–91.
19.Jian YY, Li P, Wang Q. An improved labeled latent Dirichlet Allocation model for multi-label classification. J Nanjing Univ Nat Sci Ed. 2013;49(4):425–32.
20.Li WB, Sun L, Zhang DK. Text classification based on labeled-LDA model. Chin J Comput. 2008;31(4):621–7.
21.Song SL, Wang SL, Chen P. Chinese text semantic representation for text classification. J Xidian Univ. 2013;40(2):89–97.
22.TS Teng. study on Chinese short-text classification. Master degree thesis of Tsinghua University; 2009.
作者单位：Ye-wang Chen (1)
Qing Zhou (1)
Wei Luo (1)
Ji-Xiang Du (1)

1. Xiamen, China
刊物主题：Neurosciences; Computation by Abstract Devices; Artificial Intelligence (incl. Robotics); Computational Biology/Bioinformatics;
出版者：Springer US
ISSN：1866-9964

文摘

For machine learning methods, processing and understanding Chinese texts are difficult, for that the basic unit of Chinese texts is not character but phrases, and there is no natural delimiter in Chinese texts to separate the phrases. The processing of a large number of Chinese Web texts is more difficult, because such texts are often less topic focused, short, irregular, sparse, and lacking in context. It poses a challenge for mining, clustering, and classification of Chinese Web texts. Typically, the recognition accuracy of the real meaning of such texts is low. In this paper, we propose a method that recognizes stable and abstract semantic topics that express the highly hierarchical relationship behind the Chinese texts from BaiduBaike. Then, based on these semantic topics, a discrete distribution model is established to convert analysis to a convex optimization problem by geometric programming. Our experiments demonstrated that the proposed approach outperforms many conventional machine learning methods, such as KNN, SVM, WIKI, CRFs, and LDA, regarding the recognition of mini training data and short Chinese Web texts. Keywords BaiduBaike Semantics topic Geometric programming Chinese text classification

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700