结合密度峰值优化模糊聚类的自训练方法

英文篇名：Self-Training Algorithm Combined with Density Peak Optimization Fuzzy Clustering
作者：罗云松 ; 吕佳
英文作者：LUO Yunsong;Lü Jia;College of Computer Science and Information Sciences,Chongqing Normal University;The Engineering & Technology Research Center of Digital Agriculture Service,Chongqing;
关键词：半监督学习 ; 自训练方法 ; 密度峰值优化模糊聚类 ; 聚类假设
英文关键词：semi-supervised;;self-training;;density peak optimization fuzzy clustering;;clustering hypothesis
中文刊名：CQSF
英文刊名：Journal of Chongqing Normal University(Natural Science)
机构：重庆师范大学计算机与信息科学学院;重庆市数字农业服务工程技术研究中心;
出版日期：2019-03-15 07:00
出版单位：重庆师范大学学报(自然科学版)
年：2019
期：v.36;No.166
基金：重庆市自然科学基金(No.cstc2014jcyjA40011);; 重庆市教育委员会2016年人文社会科学研究项目(No.16SKGH032);; 重庆市教育委员会科技项目(No.KJ1600322);; 重庆师范大学科研项目(No.YKC18025)
语种：中文;
页：CQSF201902016
页数：7
CN：02
ISSN：50-1165/N
分类号：101-107

摘要

【目的】为了在迭代自训练之前探索数据集分布情况,挑选出所含信息量较大且置信度较高的无标记样本加入训练集训练,让训练出的初始分类器有较高的准确性,提高自训练方法的泛化性。【方法】以聚类假设为基础,先对无标记样本集进行密度峰值聚类,在人工地选出聚类中心后,将新的聚类中心作为模糊聚类的初始聚类中心进行模糊聚类,从而筛选出有用的无标记样本。【结果】通过使用密度峰值优化模糊聚类算法,筛选出所含信息量大且置信度高的样本加入了训练集,训练出泛化性更强、分类精度更高的分类器。【结论】实验结果表明,改进后的自训练方法能快速发现样本集原始空间结构,筛选出有用无标记样本加入训练集,与结合其他聚类算法的自训练方法相比分类精度有所提高。
[Purposes]In order to explore the distribution of data sets before iterative self-training,the unlabeled samples with large amount of information and high confidence should be taken into the training set,and the initial classifiers are given higher accuracy and the generalization of self-training method is improved.[Methods]Basing on the clustering hypothesis,it first clusters the unlabeled sample set with the density peak clustering.After the clustering centers are selected out artificially,the new cluster centers are used as the initial cluster centers for fuzzy clustering.Hence the useful unlabeled samples are selected out.[Findings]By using the density peak optimization fuzzy clustering algorithm,the samples with large amount of information and high confidence are selected out and added into the training set,so that a classifier with stronger generalization and higher classification accuracy is obtained.[Conclusions]The experimental results show that the improved self-training method can quickly find the original spatial structure of the data sets,and find out the useful unlabeled samples to join the training set.Compared with the self-training method combined with other clustering algorithms,our algorithm can obtain better accuracy.

引文

[1]ROSENBERG C,HEBERT M,SCHNEIDERMAN H.Semisupervised self-training of object detection models[C]//IEEE Workshops on Application of Computer Vision.[S.l.]:IEEE Computer Society,2005:29-36.
    [2]刘建伟,刘媛,罗雄麟.半监督学习方法[J].计算机学报,2015,38(8):1592-1617.LIU J W,LIU Y,LUO X L.Semi supervised learning method[J].Chinese Journal of Computers,2015,38(8):1592-1617.
    [3]周志华.机器学习[M].北京:清华大学出版社,2016:293-294.ZHOU Z H.Machine learning[M].Beijing:Tsinghua University Press,2016:293-294.
    [4]JOACHIMS T.Transductive inference for text classification using support vector machines[C]//Sixteenth International Conference on Machine Learning.[S.l.]:Morgan Kaufmann Publishers Inc,1999:200-209.
    [5]汪西莉,蔺洪帅.最小代价路径标签传播算法[J].计算机学报,2016,39(7):1407-1418.WANG X L,LIN H S.Minimum cost path label propagation algorithm[J].Chinese Journal of Computers,2016,39(7):1407-1418.
    [6]李南.基于聚类假设的数据流分类算法[J].模式识别与人工智能,2017,30(1):1-10.LI N.Data flow classification algorithm based on clustering assumption[J].Pattern Recognition and Artificial Intelligence,2017,30(1):1-10.
    [7]GAN H,SANG N,CHEN X,et al.An improved selftraining for face recognition[C]//International Conference on Image&Graphics.[S.l.]:IEEE,2013:489-492.
    [8]GAN H,SANG N,HUANG R,et al.Using clustering analysis to improve semi-supervised classification[J].Neurocomputing,2013,101(3):290-298.
    [9]ZENG H J,WANG X H,CHEN Z,et al.CBC:Clustering based text classification requiring minimal labeled data[C]//IEEE International Conference on Data Mining.[S.l.]:IEEE,2003:443-450.
    [10]艾震鹏,王振友.基于数据密度的半监督自训练分类算法[J].计算机应用研究,2019,21(5):1-5.AI Z P,WANG Z Y.Semi supervised self-training classification algorithm based on data density[J].Application Research of Computers,2019,21(5):1-5.
    [11]WU D,SHANG M S,LUO X,et al.Self-training semisupervised classification based on density peaks of data[J].Neurocomputing,2018,275(1):180-191.
    [12]RODRIGUEZ A,ALESSANDRO L.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
    [13]吕佳,黎隽男.结合半监督聚类和数据剪辑的自训练方法[J].计算机应用,2018,38(1):110-115.LJ,LI J N.Self-training method combining semi supervised clustering and data editing[J].Application Research of Computers,2018,38(1):110-115.
    [14]刘伟涛,许信顺.一种使用未标记样本聚类信息的自训练方法[J].计算机应用研究,2010,27(9):3341-3344.LIU W T,XU X S.A self-training method using clustering information of unlabeled samples[J].Application Research of Computers,2010,27(9):3341-3344.
    [15]赵芳,马玉磊.自训练半监督加权球结构支持向量机多分类方法[J].重庆邮电大学学报(自然科学版),2014,26(3):404-408.ZHAO F,MA Y L.Multi-Class classification based on seiftraining semi-supervised weighted sphere structuied support vector machine[J].Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition),2014,26(3):404-408.
    [16]谢娟英,高红超,谢维信.K近邻优化的密度峰值快速搜索聚类算法[J].中国科学:信息科学,2016,46(2):258-280.XIE J Y,GAO H C,XIE W X.K nearest neighbor optimization density peak fast search clustering algorithm[J].Scientia Sinica Information:Informations Science,2016,46(2):258-280.
    [17]刘沧生,许青林.基于密度峰值优化的模糊C均值聚类算法[J].计算机工程与应用,2018,21(5):1-6.LIU C S,XU Q L.Fuzzy C means clustering algorithm based on density peak value[J].Computer Engineering and Applications,2018,21(5):1-6.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700