Virtual relevant documents in text categorization with support vector machines

详细信息查看全文

作者：Kyung-Soon Lee ; Kyo Kageura
关键词：Virtual document ; Prior knowledge ; Topical representation ; Text categorization ; Support vectors
刊名：Information Processing and Management
出版年：2007
出版时间：July 2007
年：2007
卷：43
期：4
页码：902-913
全文大小：868 K

文摘

This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131 % , 34 % , 12 % improvements in micro-averaged F₁ for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700