A probabilistic framework for optimizing projected clusters with categorical attributes

详细信息查看全文

作者：LiFei Chen
关键词：projective clustering ; projected cluster ; categorical data ; probabilistic framework ; kernel density estimation ; attribute weighting ; 072104 ; 鎶曞奖鑱氱被 ; 绫诲睘鏁版嵁 ; 姒傜巼鏋勬灦 ; 鏍稿瘑搴︿及璁?/li> 灞炴€у姞鏉?/li>
刊名：SCIENCE CHINA Information Sciences
出版年：2015
出版时间：July 2015
年：2015
卷：58
期：7
页码：1-15
全文大小：535 KB
参考文献：1.Aggarwal C C, Procopiuc C, Wolf J L, et al. Fast algorithm for projected clustering. ACM SIGMOD Rec, 1999, 28: 61鈥?2View Article
2.Moise G, Sander J, Ester M. Robust projected clustering. Knowl Inf Syst, 2008, 14: 273鈥?98MATH View Article
3.Chen L, Jiang Q, Wang S. Model-based method for projective clustering. IEEE Trans Knowl Data Eng, 2012, 24: 1291鈥?305View Article
4.Huang J Z, Ng M K, Rong H, et al. Automated variable weighting in k-means type clustering. IEEE Trans Patt Anal Mach Intell, 2005, 27: 657鈥?68View Article
5.Poon L, Zhang N, Chen T, et al. Variable selection in model-based clustering: to do or to facilitate. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 887鈥?94
6.Light R J, Marglin B H. An analysis of variance for categorical data. J Am Stat Assoc, 1971, 66: 534鈥?44MATH View Article
7.San O M, Huynh V N, Nakamori Y. An alternative extension of the k-means algorithm for clustering categorical data. Int J Appl Math Comput Sci, 2004, 14: 241鈥?47MATH MathSciNet
8.Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical value. Data Min Knowl Discov, 1998, 2: 283鈥?04View Article
9.Chan E Y, Ching W K, Ng M K, et al. An optimization algorithm for clustering using weighted dissimilarity measures. Patt Recogn, 2004, 37: 943鈥?52MATH View Article
10.Bai L, Liang J, Dang C, et al. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Patt Recogn, 2011, 44: 2843鈥?861MATH View Article
11.Xiong T, Wang S, Mayers A, et al. DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov, 2012, 24: 103鈥?35MATH MathSciNet View Article
12.Chen L, Wang S. Central clustering of categorical data with automated feature weighting. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, 2013. 1260鈥?266
13.Cao F, Liang J, Li D, et al. A Weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing, 2013, 108: 23鈥?0View Article
14.Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, Atlanta, 2008. 243鈥?54
15.Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett, 2004, 6: 90鈥?05View Article
16.Gan G, Wu J. Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newslett, 2004, 6: 87鈥?4View Article
17.Bai L, Liang J, Dang C, et al. The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans Patt Anal Mach Intell, 2013, 35: 1509鈥?522View Article
18.Sen P K. Gini diversity index, hamming distance and curse of dimensionality. Metron Int J Stat, 2005, LXIII: 329鈥?49
19.Tao J, Chung F, Wang S. A kernel learning framework for domain adaptation learning. Sci China Inf Sci, 2012, 55: 1983鈥?007MATH MathSciNet View Article
20.Ouyang D, Li Q, Racine J. Cross-validation and the estimation of probability distributions with categorical data. Nonparametr Stat, 2006, 18: 69鈥?00MATH MathSciNet View Article
21.Li Q, Racine J S. Nonparametric Econometrics: Theory and Practice. Princeton: Princeton University Press, 2007
22.Aitchison J, Aitken C. Multivariate binary discrimination by the kernel method. Biometrika, 1976, 63: 413鈥?20MATH MathSciNet View Article
23.Hofmann T, Scholkopf B, Smola A J. Kernel methods in machine learning. Ann Stat, 2008, 36: 1171鈥?220MATH MathSciNet View Article
24.Zhou K, Fu C, Yang S. Fuzziness parameter selection in fuzzy c-means: the perspective of cluster validation. Sci China Inf Sci, 2014, 57: 112206
25.Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Comput Surv, 1999, 31: 264鈥?23View Article
26.Li T, Ma S, Ogihara M. Entropy-based criterion in categorical clustering. In: Proceedings of the 21st International Conference on Machine Learning, Alberta, 2004. 536鈥?43
27.Wang K, Yan X, Chen L. Geometric double-entity model for recognizing far-near relations of clusters. Sci China Inf Sci, 2011, 54: 2040鈥?050MathSciNet View Article
作者单位：LiFei Chen (1)

1. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou, 350117, China
刊物类别：Computer Science
刊物主题：Chinese Library of Science
Information Systems and Communication Service
出版者：Science China Press, co-published with Springer
ISSN：1869-1919

文摘

The ability to discover projected clusters in high-dimensional data is essential for many machinelearning applications. Projective clustering of categorical data is currently a challenge due to the difficultiesin learning adaptive weights for categorical attributes coordinating with clusters optimization. In this paper,a probability-based learning framework is proposed, which allows both the attribute weights and the centerbasedclusters to be optimized by kernel density estimation on categorical attributes. A novel algorithm is thenderived for projective clustering on categorical data, based on the new learning approach for the kernel bandwidthselection problem. We show that the attribute weight substantially connects to the kernel bandwidth, whilethe optimized cluster center corresponds to the normalized frequency estimator of the categorical attributes.Experimental results on synthesis and real-world data show outstanding performance of the proposed method,which significantly outperforms state-of-the-art algorithms.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700