摘要
文本分类技术作为文本数据处理的一种重要手段,如何提高文本分类的效率具有重大的意义。基于传统的文本分类技术采用TFIDF响了文本分类效果。本文通过对TFIDF对比实验,提出了一种基于混合特征的分类方法。实验表明该方法在文本分类效果F著提升,证明了本文改进方法的有效性。
Text classification technology is an important method for text data processing,how to improve the efficiency of text classification has great significance.TFIDF algorithm is applied to calculate the weight of traditional text classification technology without considering the distribution of feature items among categories,which affects the effect of text classification. In this paper,an improved TFIDF is proposed and Labeled-LDA model is integrated. Combined with text classification comparison experiment,a classification method based on mixed characteristics is proposed. The experiment shows that this method has significantly improved the F value of text classification effect,which proves the effectiveness of the improved method in this paper.
引文
[1]中国互联网络信息中心,第37次中国互联网络发展状况统计报告[R].北京:中国互联网络信息中心(CNNIC),2016.1.
[2] Nigam K,McCallum A K,Thrun S,et al. Text classification from labeled and unlabeled documents using EM[J]. Machine learning,2000,39(2-3):103-134.
[3]张建娥.基于TFIDF和词语关联度的中文关键词提取方法[J].计算机技术与发展,2006(3):122-123,222.
[4]成松松,艾丽蓉.基于平均词频的文本特征提取方法[J].计算机应用与软件,2013(10):243-245.
[5] Swapnil Hingmire,Sutanu Chakraborti. Sprinkling topics for weakly supervised text classification[C].In ACI. Baltimore,USA,2014:55-60.
[6] Salton Q.A vector space model for automatic indexing[J]. Communications of ACM,1975,18(11):613-620.
[7] Auen J.Natural language understanding[M].[S.1.]:The Benjamin Cummings Publishing Company,1991.
[8] Witten I H,Frank E,Hall M A,et al. Data Mining:Practical machine learning tools and techniques[M]. Morgan Kaufmann,2016.
[9]李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008(4):620-627.
[10]吴军.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-33.
[11]顾益军.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-339.
[12]Vapnik V,Levin E,Le Cun Y. Measuring the VCdimension of a learning machine[J]. Neural Computation,1994,6(5):851-876.
[13]Ghamisi P,Couceiro M S,Benediktsson J A. A novel feature selection approach based on FODPSO and SVM[J]. Geoscience and Remote Sensing,IEEE Transactions on,2015,53(5):2935-2947.
[14]Halides. Quality scheme assesssment in the clustering process[C]. Proc of the 4th Europe Conf Principles and Practice of Knowledge Discovery in Databases,2000.