摘要
传统的基于空间向量的文本谱聚类方法容易忽略文本上下文之间的语义联系,通过图结构进行文本表示可以很好的解决这一问题,在此基础上,本文提出了基于最大公共子图的谱聚类算法——SC-MCS算法。该算法通过求解文本之间的最大公共子图来进行文本相似度的计算,最后进行文本聚类。实验结果表明,与传统的基于空间向量的文本谱聚类方法相比,该算法在准确率和召回率都取得了一定的提升。
When using the traditional text spectral clustering method based on vector space,the context semantic relations are easily ignored. But the problem can be solved by representing text through the graph structure,on the basis of which,a spectral clustering algorithm based on the maximum common subgraph was proposed( hereafter called SC-MCS). The algorithm calculates text similarity by solving the maximum common subgraph of texts.The experimental results show that compared with the traditional text spectral clustering method based on vector space,the algorithm has improved accuracy and recall rate.
引文
[1]VONLUXBURG U.A tutorial on spectral clustering[J].Statistics and computing,2007,17(4):395-416.
[2]SALTON G,WONG A,YANG C S.A vector space model for automatic indexing[J].Communications of the Acm,1975,18(11):613-620.
[3]SCHENKER A,LAST M,BUNKE H,et al.Comparison of distance measures for graph-based clustering of documents[C]//Iapr International Conference on Graph Based Representations in Pattern Recognition.York,UK:Springer-Verlag,2003:202-213.
[4]BUNKE H,FOGGIA P,GUIDOBALDI C,et al.A Comparison of Algorithms for Maximum Common Subgraph on Randomly Connected Graphs[C]//Joint Iapr International Workshop on Structural,Syntactic,and Statistical Pattern Recognition.Italy:Springer-Verlag,2002:123-132.
[5]SHI J,MALIK J.Normalized Cuts and Image Segmentation[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2000,22(8):888-905.
[6]蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述[J].计算机科学,2008,35(7):14-18.
[7]周昭涛,卜东波,程学旗.文本的图表示初探[J].中文信息学报,2005,19(2):36-43.
[8]刘巧凤.基于图结构的中文文本聚类方法研究[D].大连:大连理工大学,2009.