Semantic smoothing for text clustering

详细信息查看全文

作者：Jamal A. Nasir ; Iraklis Varlamis ; Asim Karim ; George Tsatsaronis
关键词：Text clustering ; Semantic smoothing kernels ; WordNet ; Wikipedia ; Generalized vector space model kernel
刊名：Knowledge-Based Systems
出版年：December, 2013
年：2013
卷：54
期：Complete
页码：216-229
全文大小：573 K

文摘

In this paper we present a new semantic smoothing vector space kernel (S-VSM) for text documents clustering. In the suggested approach semantic relatedness between words is used to smooth the similarity and the representation of text documents. The basic hypothesis examined is that considering semantic relatedness between two text documents may improve the performance of the text document clustering task. For our experimental evaluation we analyze the performance of several semantic relatedness measures when embedded in the proposed (S-VSM) and present results with respect to different experimental conditions, such as: (i) the datasets used, (ii) the underlying knowledge sources of the utilized measures, and (iii) the clustering algorithms employed. To the best of our knowledge, the current study is the first to systematically compare, analyze and evaluate the impact of semantic smoothing in text clustering based on 鈥?em>wisdom of linguists鈥? e.g., WordNets, 鈥?em>wisdom of crowds鈥? e.g., Wikipedia, and 鈥?em>wisdom of corpora鈥? e.g., large text corpora represented with the traditional Bag of Words (BoW) model. Three semantic relatedness measures for text are considered; two knowledge-based (Omiotis that uses WordNet, and WLM that uses Wikipedia), and one corpus-based (PMI trained on a semantically tagged SemCor version). For the comparison of different experimental conditions we use the BCubed F-Measure evaluation metric which satisfies all formal constraints of good quality cluster. The experimental results show that the clustering performance based on the S-VSM is better compared to the traditional VSM model and compares favorably against the standard GVSM kernel which uses word co-occurrences to compute the latent similarities between document terms.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700