Text categorization for a comprehensive time-dependent benchmark

详细信息查看全文

作者：Damerau ; Fred J. ; Zhang ; Tong ; Weiss ; Sholom M. ; Indurkhya ; Nitin
关键词：Text categorization ; Machine learning ; Scalability ; Benchmark ; Classification of very large corpora
刊名：Information Processing and Management
出版年：2004
出版时间：March, 2004
年：2004
卷：40
期：2
页码：209-221
全文大小：289 K

文摘

We present results for automated text categorization of the Reuters-810000 collection of news stories. Our experiments use the entire one-year collection of 810,000 stories and the entire subject index. We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection. Experimental results show that efficient sparse-feature implementations of linear methods and decision trees, using a global unstemmed dictionary, can readily handle applications of this size. Predictive performance is approximately as strong as the best results for the much smaller older Reuters collections. Detailed results are provided over time periods. It is shown that a smaller time horizon does not appreciably diminish predictive quality, implying reduced demands for retraining when sample size is large.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700