面向高维数据的安全半监督分类算法

英文篇名：Safe Semi-supervised Classification Algorithm for High Dimensional Data
作者：赵建华 ; 刘宁
英文作者：ZHAO Jian-Hua;LIU Ning;College of mathematics and computer application, Shangluo University;Faculty of Economics and Management, Shangluo University;
关键词：高维数据 ; 半监督学习 ; 随机子空间 ; 集成技术 ; 分类
英文关键词：high dimensional data;;semi-supervised learning;;stochastic subspace;;ensemble technology;;classification
中文刊名：XTYY
英文刊名：Computer Systems & Applications
机构：商洛学院数学与计算机应用学院;商洛学院经济管理学院;
出版日期：2019-05-15
出版单位：计算机系统应用
年：2019
期：v.28
基金：陕西省自然科学基础研究计划(2015JM6347);; 商洛学院科研项目(14SKY026);商洛学院科技创新团队建设项目(18SCX002);商洛学院重点学科建设项目(学科名:数学)~~
语种：中文;
页：XTYY201905027
页数：7
CN：05
ISSN：11-2854/TP
分类号：180-186

摘要

半监督学习过程中,由于无标记样本的随机选择造成分类器性能降低及不稳定性的情况经常发生;同时,面对仅包含少量有标记样本的高维数据的分类问题,传统的半监督学习算法效果不是很理想.为了解决这些问题,本文从探索数据样本空间和特征空间两个角度出发,提出一种结合随机子空间技术和集成技术的安全半监督学习算法(A safe semi-supervised learning algorithm combining stochastic subspace technology and ensemble technology,S3LSE),处理仅包含极少量有标记样本的高维数据分类问题.首先, S3LSE采用随机子空间技术将高维数据集分解为B个特征子集,并根据样本间的隐含信息对每个特征子集优化,形成B个最优特征子集;接着,将每个最优特征子集抽样形成G个样本子集,在每个样本子集中使用安全的样本标记方法扩充有标记样本,生成G个分类器,并对G个分类器进行集成;然后,对B个最优特征子集生成的B个集成分类器再次进行集成,实现高维数据的分类.最后,使用高维数据集模拟半监督学习过程进行实验,实验结果表明S3LSE具有较好的性能.
In the semi-supervised learning process, the performance of the classifier is often degraded and unstable due to the random selection of unlabeled samples. At the same time, the performance of the traditional semi-supervised learning algorithm is not sufficient for the classification problem of high-dimensional data containing only a small number of labeled samples. In order to solve these problems, this study proposes a safe semi-supervised learning algorithm S3LSE,which combines stochastic subspace technology with ensemble technology from the perspective of exploring data sample space and feature space. Firstly, S3LSE decomposes the high-dimensional data set into B feature subsets using random subspace technique, and optimizes each feature subset according to the implicit information among the samples to form B optimal feature subsets. Then, each optimal feature subset is sampled to form G sample subsets, and a safe sample marking method is used in each sample subset. The learning algorithm generates G classifiers and integrates G classifiers,and then integrates B classifiers generated by B optimal feature subsets to realize the classification of high-dimensional data. Finally, a high dimensional data set is used to simulate semi-supervised learning and the experiment result shows that the algorithm has better performance.

引文

1梁吉业,高嘉伟,常瑜.半监督学习研究进展.山西大学学报(自然科学版),2009,32(4):528-534.
    2Belkin M,Niyogi P.Semi-supervised learning on Riemannian manifolds.Machine Learning,2004,56(1-3):209-239.
    3Wang M,Fu WJ,Hao SJ,et al.Scalable semi-supervised learning by efficient anchor graph regularization.IEEE Transactions on Knowledge and Data Engineering,2016,28(7):1864-1877.[doi:10.1109/TKDE.2016.2535367]
    4Sheikhpour R,Sarram MA,Gharaghani S,et al.A survey on semi-supervised feature selection methods.Pattern Recognition,2017,64:141-158.[doi:10.1016/j.patcog.2016.11.003]
    5刘建伟,刘媛,罗雄麟.半监督学习方法.计算机学报,2015,38(8):1592-1617.
    6周志华.基于分歧的半监督学习.自动化学报,2013,39(11):1871-1878.
    7Yu ZW,Zhang YD,Chen CLP,et al.Multiobjective semisupervised classifier ensemble.IEEE Transactions on Cybernetics,2018.[doi:10.1109/TCYB.2018.2824299]
    8Roy M,Ghosh S,Ghosh A.A novel approach for change detection of remotely sensed images using semi-supervised multiple classifier system.Information Sciences,2014,269:35 -47.[doi:10.1016/j.ins.2014.01.037]
    9蔡毅,朱秀芳,孙章丽,等.半监督集成学习综述.计算机科学,2017,44(S1):7-13.
    10Zhou ZH,Li M.Tri-training:Exploiting unlabeled data using three classifiers.IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541.[doi:10.1109/TKDE.2005.186]
    11Mallapragada PK,Jin R,Jain AK,et al.SemiBoost:Boosting for semi-supervised learning.IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(11):2000-2014.[doi:10.1109/TPAMI.2008.235]
    12Yaslan Y,Cataltepe Z.Co-training with relevant random subspaces.Neurocomputing,2010,73(10-12):1652-1661.
    13Yan Y,Xu ZW,Tsang IW,et al.Robust semi-supervised learning through label aggregation.Proceedings of the 30th AAAI Conference on Artificial Intelligence.Phoenix,AZ,USA.2016.2244-2250.
    14Stanescu A,Caragea D.An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.BMC Systems Biology,2015,9(S5):S1.
    15Yu ZW,Lu Y,Zhang J,et al.Progressive semisupervised learning of multiple classifiers.IEEE Transactions on Cybernetics,2018,48(2):689-702.[doi:10.1109/TCYB.2017.2651114]
    16Ding SF,Jia HJ,Du MJ,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model.Information Sciences,2017,429:215-228.
    17Liu L,Yang LC,Zhu B.Sparse feature space representation:A unified framework for semi-supervised and domain adaptation learning.Knowledge-Based Systems,2018,156:43-61.[doi:10.1016/j.knosys.2018.05.011]
    18Yu ZW,Luo PN,You J,et al.Incremental semi-supervised clustering ensemble for high dimensional data clustering.IEEE Transactions on Knowledge and Data Engineering,2016,28(3):701-714.[doi:10.1109/TKDE.2015.2499200]
    19赵建华.一种基于交叉验证思想的半监督分类方法.西南科技大学学报,2014,29(1):34-38,48.[doi:10.3969/j.issn.1671-8755.2014.01.008]
    20蒋长鸿,范钢龙.先验信息优化的S3VM算法模型研究.西北工业大学学报,2017,35(5):786-792.[doi:10.3969/j.issn.1000-2758.2017.05.007]

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700