Machine learning approaches for determining effective seeds for k-means algorithm.

详细信息

作者：Lertwachara ; Kaveephong.
学历：Doctor
年：2003
导师：Cochran, James J.
毕业院校：Louisiana Tech University
专业：Statistics.;Business Administration, General.;Computer Science.;Artificial Intelligence.
CBH：3084539
Country：USA
语种：English
FileSize：3728973
Pages：116

文摘

In this study, I investigate and conduct an experiment on two-stage clustering procedures, hybrid models in simulated environments where conditions such as collinearity problems and cluster structures are controlled, and in real-life problems where conditions are not controlled. The first hybrid model (NK) is an integration between a neural network (NN) and the k-means algorithm (KM) where NN screens seeds and passes them to KM. The second hybrid (GK) uses a genetic algorithm (GA) instead of the neural network. Both NN and GA used in this study are in their simplest-possible forms.;In the simulated data sets, I investigate two properties: clustering performance comparisons and effects of five factors (scale, sample size, density, number of clusters, and number of variables) on the five clustering approaches (KM, NN, NK, GA, GK). Density, number of clusters, and dimension influence the clustering performance of all five approaches. KM, NK, and GK classify well when all clusters contain a similar number of observations, while NK and GK perform better than the KM. NN performs well when one cluster contains more observations than any other cluster. The two hybrid models perform at least as well as KM, although the environments are in favor of the KM. The most crucial information, the true number of clusters, is provided to the KM only. In addition, the cluster structures are simple: the clusters are well separated; the variances and cluster sizes are uniform; the correlation between any pair of variables and collinearity problems are not significant; and the observations are normally distributed.;Real-life problems consist of three problems with a known natural cluster structure and one problem with an unknown natural cluster structure. Overall results indicate that GK performs better than KM, while NK is the worst performing among the five approaches. The two machine learning approaches generate better results than KM in an environment that does not favor KM.;GK has shown to be the best or among the best in a simulated environment and in real-life situations. Furthermore, the GK can detect firms with promising financial prospect such as acquisition targets and firms with “buy” recommendation, better than all other approaches.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700