利用聚类分析和离群点检测的数据填补方法

英文篇名：Data filling using cluster analysis and outlier detection
作者：马永军 ; 汪睿 ; 李亚军 ; 陈海山
英文作者：MA Yong-jun;WANG Rui;LI Ya-jun;CHEN Hai-shan;College of Computer Science and Information Engineering,Tianjin University of Science and Technology;Research Center for Food Safety Management and Strategy,Tianjin University of Science and Technology;
关键词：核方法 ; 聚类分析 ; 缺失数据 ; 数据填补 ; 离群点检测
英文关键词：kernel method;;clustering analysis;;missing data;;data filling;;outlier detection
中文刊名：SJSJ
英文刊名：Computer Engineering and Design
机构：天津科技大学计算机科学与信息工程学院;天津科技大学食品安全管理与战略研究中心;
出版日期：2019-03-16
出版单位：计算机工程与设计
年：2019
期：v.40;No.387
基金：天津市科技计划基金项目(17KPXMSF00140、17ZLZXZF00470);; 天津市科技基金项目(KJCX-KFQ-CXY-2016-003)
语种：中文;
页：SJSJ201903025
页数：5
CN：03
ISSN：11-1775/TP
分类号：151-154+168

摘要

为提高数据填补方法的正确率,提出利用核K-Means聚类和离群点检测来填补缺失数据的算法(KKMOD)。用核方法将数据集映射到高维空间,聚类后形成不同簇,在同簇内选择与缺失数据最相似的数据进行填补,使用核K-Means进行离群点检测,将检测到的离群点去除填补值,重新放入数据集填补,算法不断迭代,直到填补的数据不再检测出离群点。实验结果表明,KKMOD方法能够充分考虑簇内关系,避免不同簇相互干扰,提高数据填补算法的正确率。
To improve the accuracy of the data filling method,an algorithm for filling the missing data was proposed using the kernel K-Means clustering and outlier detection method(KKMOD).The data set was mapped to high dimensional space using the kernel method,and the cluster algorithm was used to shape different clusters and the most similar data in the same cluster were selected to fill the missing data,and kernel K-Means was used to detect the outliers,and the value that was filled in the detected outlier points was removed,and the value was put in the data again.The algorithm iterated until no outliers detected in the filled data.Experimental results show that the KKMOD method can fully consider the relations in the cluster and avoid the interference of different clusters,thus improving the accuracy of data filling algorithm.

引文

[1]Li Z,Qin L,Cheng H,et al.TRIP:An interactive retrieving-inferring data imputation approach[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(9):2550-2563.
    [2]ZHANG Wen,JIANG Yipan,YIN Guangda,et al.Handling missing values in software effort data based on na6ve Bayes and EM algorithm[J].Systems Engineering-Theory&Practice,2017,37(11):2965-2974(in Chinese).[张文,姜祎盼,殷广达,等.基于朴素贝叶斯和EM算法的软件工作量缺失数据处理方法[J].系统工程理论与实践,2017,37(11):2965-2974.]
    [3]DING Chunrong,LI Longshu.Improved ROUSTIDA algorithm based on similarity relation vector[J].Computer Engineering&Applications,2014,50(13):133-136(in Chinese).[丁春荣,李龙澍.基于相似关系向量的改进ROUSTI-DA算法[J].计算机工程与应用,2014,50(13):133-136.]
    [4]HAO Shengxuan,SONG Hong,ZHOU Xiaofeng.Predicting missing values with knn based on the elimination of neighbor noise[J].Computer Simulation,2014,31(7):264-268(in Chinese).[郝胜轩,宋宏,周晓锋.基于近邻噪声处理的KNN缺失数据填补算法[J].计算机仿真,2014,31(7):264-268.]
    [5]Tang F,Ishwaran H.Random forest missing data algorithms[J].Statistical Analysis&Data Mining the Asa Data Science Journal,2017.
    [6]ZHANG Xiaoqin,CHENG Yuying.Imputation of missing values for compositional data based on random forest[J].Chinese J Appl Probab Statist,2017,33(1):102-110(in Chinese).[张晓琴,程誉莹.基于随机森林模型的成分据缺失值填补法[J].应用概率统计,2017,33(1):102-110.]
    [7]Fauer S,Schwenker F.Semi-supervised clustering of large data sets with kernel methods[J].Pattern Recognition Letters,2014,37(1):78-84.
    [8]Wang T,Zhao D,Tian S.An overview of kernel alignment and its applications[J].Artificial Intelligence Review,2015,43(2):179-192.
    [9]Piciarelli C,Micheloni C,Foresti G L.Kernel-based clustering[J].Electronics Letters,2013,49(2):113-U7.
    [10]WANG Peiyan,CAI Dongfeng.Distance-based kernel evaluation measure[J].Computer Science,2014,41(2):72-75(in Chinese).[王裴岩,蔡东风.一种基于核距离的核函数度量方法[J].计算机科学,2014,41(2):72-75.]
    [11]Kim S,Cho N W,Lee Y J,et al.Application of densitybased outlier detection to database activity monitoring[J].Information Systems Frontiers,2013,15(1):55-65.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700