摘要
经典KNN算法在处理高维数据或样本数繁多的样本集时需要巨大的计算量,这使其在实际应用的过程中存在着一定的局限性;提出一种基于聚类和密度裁剪的改进KNN算法。在训练阶段,首先根据样本密度对整个训练集进行裁剪,然后将裁剪好的训练集进行聚类处理,得到若干个密度比较均匀的类簇并将其转化为超球。在测试阶段,采用两种方法,第一种是找出距离待测样本最近的k个超球,然后将这个k个超球内的训练样本作为新的训练样本集,在这个新的训练样本集上使用经典KNN算法得到待测样本的类别;第二种则是找出距离待测样本最近的1个超球,然后根据该超球的类别得出待测样本的类别。实验采用8个UCI样本集进行测试,实验结果表明,该算法同经典KNN相比具有良好的性能,是一种有效的分类方法。
Classical KNN method has some limitations in the practical application process because of its large computational demands when using it to deal with high-dimensional data set including lots of samples.An improved KNN method is proposed for reducing the amount of training samples based on clustering and density.In the training stage,first,reduce the amount of training samples based on the samples' density,then,cluster the training samples and turn the class clusters into hyper-spheres.In the testing stage,two methods are designed,the first is to find the testing sample's knearest hyper-spheres,and recognize all the training samples in the k nearest hyper-spheres as the new train set,then use the classical KNN method to get the testing sample's class in this new training set.The second is to find the testing sample's nearest hyper-sphere,and get the testing sample's class according to the hyper-sphere's class.Eight UCI datasets are used to do the experiments.The results show that the improved KNN method is effective and has good performance compared with classical KNN method.
引文
[1]刘红岩,陈剑,陈国青.数据挖掘中的数据分类算法综述[J].清华大学学报,2002,42(6):121-167.
[2]钟将,刘荣辉.一种改进的KNN文本分类[J].计算机工程与应用,2012,48(2):142-144.
[3]Hart B P E.The Condensed Nearest Neighbor Rule[C]//IEEE Trans.Information Theory.1968.
[4]李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539-546.
[5]梁俊杰,王长磊.利用分区和距离实现高维空间快速KNN查询[J].计算机研究与发展,2007,44(11):1980-1985.
[6]刘辉,应培培.一种改进的KNN文本分类算法[J].信息安全与技术,2011,(7):25-27.
[7]Alex Rodriguez and Alessandro Laio.Clustering by fast search and find of density peaks.Science 344,1492(2014);DIO:10.1126/science.1242072.
[8]张玉芳,毛嘉莉,熊忠阳.一种改进的K-means算法[J].计算机应用,2003,23(8):31-33.
[9]张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法[J].模式识别与人工智能,2009,22(6):936-940.
[10]Yang Y.A re-examination of text categorization methods[C]//International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1999:42-49.
[11]Lewis D D.Naive(Bayes)at forty:The independence assumption in information retrieval[C]//European Conference on Machine Learning.Springer-Verlag,1998:4-15.
[12]Joachims T.Text categorization with Support Vector Machines:Learning with many relevant features[M].Machine Learning:ECML-98.Springer Berlin Heidelberg,1998:137-142.
[13]孙丽华,张积东,李静梅.一种改进KNN方法及其在文本分类中的应用[J].应用科技,2002,29(2):25-27.