半监督自训练分类模型的研究与实现

英文题名：Research and Implementation of Semi-Supervised Based Self-Training Classification Model
作者：丁涛
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：半监督分类 ; 数据剪辑 ; 自训练 ; 未标记数据
英文关键词：Semi-Supervised Classification ; Data Editing ; Self-Training ; Unlabeled Data
学位年度：2009
导师：孟军
学科代码：081202
学位授予单位：大连理工大学
论文提交日期：2009-11-14

摘要

半监督学习是近年来提出的一种新的学习方法,根据学习目的的不同大致可以分为半监督分类和半监督聚类。其主要思想是在已标记训练数据集较少的情况下,如何结合大量的未标记数据来改善学习性能。
     本文探讨的是半监督分类。主要针对半监督分类算法中典型的自训练分类算法进行了大量的研究与分析。针对自训练分类模型在初始阶段已标记训练集较少的情况,训练得到的分类器性能不高的事实,进行了适当的改进。即在自训练分类模型中引入了基于最近邻规则的数据剪辑技术,试图辨别出在训练过程和分类过程中引入的误标记数据从而起到净化训练集的目的。在训练的迭代过程中使用该技术,辨别和清除噪音,净化训练集,提高分类准确率。本文的实验数据集采用UCI机器学习库中随机抽取的数据集。实验结果表明,引入该数据剪辑技术后的分类模型相对于原模型在分类准确率上有不同程度的提高,经过对实验数据进行分析总结,平均分类准确率提高了6.705%。
     本文还针对Tri-Training分类模型分类能力的局限性,进行了适当的改进。使用了一种基于不同分类器之间相互合作,利用投票选举的方式对未标记数据进行标记的模型。该模型针对传统的由Zhou等人提出的Tri-Training分类模型利用相同分类器之间相互合作,投票选举的方式给出了改进模型。在基于不同分类器相互合作的同时,如同自训练分类模型的改进,同样引用了基于最近邻规则的数据剪辑技术,该技术旨在减少噪音数据净化训练集。实验数据集同样来自UCI机器学习库中随机抽取的数据集。通过实验表明,改进后的模型相对原模型在分类精度上有不同程度的改进,经过对实验数据进行分析总结,分类准确率有不同程度的提高。
Semi-Supervised Learning is a new studying method proposed in recent years. It can be divided into two categories semi-supervised classification and semi-supervised clustering respectively according to its studying purpose. Its main idea is that how can we combine the labeled training set with small number and the unlabeled ones with large number to improve the performance of the classification.
     We discuss semi-supervised classification mainly in this paper and we make a lot of research and analysis on self-training algorithm which is a classic algorithm in semi -supervised classification. We attempt to give an improved model based on the truth that when in initial the training set is so small and the classifier we get can not be so accuracy as we have expected. We introduce a data editing technique that based on nearest neighbor rules to identity the wrong labeled ones in the training and classifying process in order to purify the training set. We exploit this technique in the iteration process of the training to identify and remove the noise data, purify the training set, improve the accuracy of the classification. The experiment data sets in this paper are selected randomly from the UCI machine learning repository and the result shows that the classification accuracy of the improved one are improved differently. According to the analysis of the result we can conclude that the average classification performance is improved by 6.705%.
     According to the fact that the Tri-Training model's generalization is weak, so we also give an improved model. In this paper an improved model that the different classifiers are in cooperation with each other and the vote rule is used as the rule to classify the unlabeled data. The improved one is based on the model that proposed by Zhou. And we also introduce the data editing technique as we have done in self-training algorithm to purify the training set. The experiment data set are also from the UCI machine learning repository. According to our experiment data, we can conclude that the new model we proposed have a good performance in classification and the accuracy of the classification is improved.

引文

[1]Hartley H O,Rao J N K.Classification and Estimation in Aualysis of Variance problem[J].Review of International Statistical Institute,1968,36(3):141-147.
    [2]Dempster A P,Laird N M,Rubin D B.Maximum Likelihood from Incomplete Data via the EM Algorithm[J].Journal of the Royal Statistical Society,Series B,1977,39(1):1-38.
    [3]A Blum and T Mitchell.Combining labeled and unlabeled data with co-training[C].Proceedings of 11th Annual Conference on Computational Learning Theory.Madison,WI,1998:92-100.
    [4]S Goldman,Y Zhou.Enhancing supervised learning with unlabeled data[C].Proceedings of the 17th International Conference on Machine Learning.San Francisco,CA,2000:327-334.
    [5]Zhou Zhi Hua,Li Ming.Tri-Training:Exploiting unlabeled data using three Classifiers[J].IEEE Trans.on Knowledge and Data Engineering,2005,17(11):1529-1541.
    [6]周志华,王珏.机器学习及其应用[M].北京:清华大学出版社,2007.
    [7]Salton G,Yang C S.On the specification of term values in automatic indexing[J].Journal of Documentation,1973,29(4):351-372.
    [8]Jones K S,Walkerr S,Robertson S E.A probabilistic model of information retrieval:development and comparative experiments[J].Information Processing and Management,2000,36(6):779-808.
    [9]秦国锋,李启炎.基于数据挖掘的知识获取与发现[J].计算机工程,2003,29(21):20-22.
    [10]Liu T,Liu S P,Chen Z,et al.An evaluation on feature selection for text clustering[C].Twentieth International Conference on Machine Learning,Washington DC,USA,2003:488-495.
    [11]Yang Y M,Pedersen J O.A comparative study on feature selection in text categorization[C].14th International Conference on Machine Learning ICML97,Nashville,USA,1997:412-420.
    [12]Moica R,Yang Y M.High performing feature selection for text classification[C].Eleventh International Conference on Information and Knowledge Management,Virginia,USA,2002:59-61.
    [13]冯是聪,单松巍,龚笔宏,等.“天网”目录导航服务研究[J].计算机研究与发展,2004,41(4):653-659.
    [14]黄菁萱,吴立德.基于向量空间模型的文档分类系统[J].模式识别和人工智能,1998,11(2):147-153.
    [15]Cover T M,Hart P E.Nearest neighbor pattern classification[J].IEEE Transaction on Information Theory,1967,13(1):21-27.
    [16]Drucker H,Wu D H,Vapnik V N.Support vector machines for spam categorization[J].IEEE Transactions on Neural Networks,1999,10(5):1048-1054.
    [17]B Shahshahani,D Landgrebe.The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon[J].IEEE Transactions on Geoscience and Remote Sensing,1994,32(5):1087-1095.
    [18]D J Miller,H S Uyar.A mixture of experts classifier with learning based on both labeled and unlabelled data[C].Advances in Neural Information Processing Systems 9,Cambridge,MA:MIT Press,1997:571-577.
    [19]X Zhu,Z Ghahramani,J Lafferty.Semi-supervised learning using Gaussian fields and harmonic functions[C].Proceedings of the 20th International Conference on Machine Learning(ICML' 03),Washington DC,2003:912-919.
    [20]D Zhou,O Bousquet,T N Lal,et al.Learning with local and global consistency[C].Advances in Neural Information Processing Systems 16,Cambridge,MA:MIT Press,2004:321-328.
    [21]M BelkinP Niyogi,V Sindwani.On manifold regularization[C].Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS' 05),Savannah Hotel,Barbados,2005:17-24.
    [22]Jiang Yuan,Zhou Zhi Hua,Editing training data for KNN classifiers with neural network ensemble[C].Proceedings of the 1st International Symposium on Neural Networks.Dalian China,2004:356-361.
    [23]邓超,郭茂祖.基于Tri-Training和数据剪辑的半监督聚类算法[J].软件学报,2008,19(03):663-673.
    [24]Li Ming,Zhou Zhi Hua.SETRED:Self-Training with editing[C].Proceedings of the Advances in Knowledge Discovery and Data Mining(PAKDD2005) LNAI 3518.Heidelberg,2005:611-621.
    [25]I Cohen,F G Cozman,N Sebe,et al.Semi-supervised learning of classifiers:Theory,algorithm,and their application to human-computer interaction[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(12):1553-1567.
    [26]周志华.半监督学习专刊前言[J].软件学报,2008,19(11):2789-2790.
    [27]邓超,郭茂祖.基于自适应数据剪辑策略的Tri-training算法[J].计算机学报,2007,30(08):1213-1226.
    [28]李昆仑,张伟,代运娜.基于Tri-Training的半监督SVM[J].计算机工程与运用,2009,45(22):103-106.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700