摘要
协同聚类是对数据矩阵的行和列两个方向同时进行聚类的一类算法。本文将双层加权的思想引入协同聚类,提出了一种双层子空间加权协同聚类算法(TLWCC)。TLWCC对聚类块(co-cluster)加一层权重,对行和列再加一层权重,并且算法在迭代过程中自动计算块、行和列这三组权重。TLWCC考虑不同的块、行和列与相应块、行和列中心的距离,距离越大,认为其噪声越强,就给予小权重;反之噪声越弱,给予大权重。通过给噪声信息小权重,TLWCC能有效地降低噪声信息带来的干扰,提高聚类效果。本文通过四组实验展示TLWCC算法识别噪声信息的能力、参数选取对算法聚类结果的影响程度,算法的聚类性能和时间性能。
Co-clustering algorithms cluster a data matrix into row clusters and column clusters simultaneously. In this paper, we propose TLWCC, a two-level subspace weighting co-clustering algorithm, and introduces the idea of a two-level subspace weighting method into the co-clustering process. TLWCC adds the first level of weights on co-clusters, and then adds the second level of weights on rows and columns. The three types of weights (co-cluster, row and column weights) are computed in the clustering progress, according to the distances between co-clusters (or rows, columns) and their centers. The larger the distance is, the stronger noise it implies, so a smaller weight is given and vice verse. Thus, by giving small weights to noise, TLWCC filters out the noise and improves the co-clustering result. We propose an iterative algorithm to optimize the model. We carried out four experiments to learn more about TLWCC. The first experiment investigated the properties of three types of weights. The second experiment studied how the clustering result was influenced by the parameters. The third experiment compared the clustering performance of TLWCC with other three algorithms. The fourth experiment examined the computational efficiency of our proposed algorithm.
引文
[1]Madeira S C,Oliveira A L.Biclustering algorithms for biological data analysis:a survey[J].Computational Biology and Bioinformatics,IEEE/ACM Transactions on,IEEE,2004,1(1):24–45.
[2]Song Y,Pan S,Liu S,et al.Constrained Text Co-Clustering with Supervised and Unsupervised Constraints[J].IEEE Transactions on Knowledge and Data Engineering,IEEE,2012.
[3]George T,Merugu S.A scalable collaborative filtering framework based on co-clustering[C]//Data Mining,Fifth IEEE International Conference on.2005:4–pp.
[4]Li J,Shao B,Li T,et al.Hierarchical Co-Clustering:A New Way to Organize the Music Data[J].IEEE Transactions on Multimedia,2012,14(2):471–481.
[5]Fan N,Boyko N,Pardalos P M.Recent advances of data biclustering with application in computational neuroscience[J].Computational Neuroscience,Springer,2010:105–132.
[6]Guo G,Chen S,Chen L.Soft subspace clustering with an improved feature weight self-adjustment mechanism[J].International Journal of Machine Learning and Cybernetics,Springer,2012,3(1):39–49.
[7]Chen X J,Ye Y M,Huang J Z.A feature group weighting method for subspace clustering of high-dimensional data[J].Pattern Recognition,2012,45(1):434–446.
[8]Deng Z,Choi K S,Chung F L,et al.Enhanced soft subspace clustering integrating within-cluster and between-cluster information[J].Pattern Recognition,Elsevier,2010,43(3):767–781.
[9]Jing L,Ng M,Huang Z.An Entropy Weighting k-MeansAlgorithm for Subspace Clustering of High-Dimensional Sparse Data[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(8):1026–1041.
[10]Hartigan J A.Direct clustering of a data matrix[J].Journal of the American Statistical Association,JSTOR,1972:123–129.
[11]Dhillon I S,Mallela S,Modha D S.Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.2003:89–98.
[12]Cho H,Dhillon I S,Guan Y,et al.Minimum sum-squared residue co-clustering of gene expression data[C]//Proceedings of the fourth SIAM international conference on data mining.2004,114.
[13]Lomet A,Govaert G,Grandvalet Y.Design of Artificial Data Tables for Co-Clustering Analysis[R].France:2012.
[14]Manning C D,Raghavan P,Schütze H.Introduction to Information Retrieval[M].Cambridge University Press Cambridge,2008,1.
[15]MacQueen J B.Some methods for classification and analysis of multivariate observation[J].Proceedings of the5th Berkeley Symposium on Mathematical Statistica and Probability,1967:281–297.
[16]Huang Z,Ng M,Rong H,et al.Automated Variable Weighting in k-Means Type Clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(5):657–668.
[17]Banerjee A,Dhillon I,Ghosh J,et al.A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation[J].Journal of Machine Learning Research,2007,8:1919–1986.