摘要
为提高数据填补方法的正确率,提出利用核K-Means聚类和离群点检测来填补缺失数据的算法(KKMOD)。用核方法将数据集映射到高维空间,聚类后形成不同簇,在同簇内选择与缺失数据最相似的数据进行填补,使用核K-Means进行离群点检测,将检测到的离群点去除填补值,重新放入数据集填补,算法不断迭代,直到填补的数据不再检测出离群点。实验结果表明,KKMOD方法能够充分考虑簇内关系,避免不同簇相互干扰,提高数据填补算法的正确率。
To improve the accuracy of the data filling method,an algorithm for filling the missing data was proposed using the kernel K-Means clustering and outlier detection method(KKMOD).The data set was mapped to high dimensional space using the kernel method,and the cluster algorithm was used to shape different clusters and the most similar data in the same cluster were selected to fill the missing data,and kernel K-Means was used to detect the outliers,and the value that was filled in the detected outlier points was removed,and the value was put in the data again.The algorithm iterated until no outliers detected in the filled data.Experimental results show that the KKMOD method can fully consider the relations in the cluster and avoid the interference of different clusters,thus improving the accuracy of data filling algorithm.
引文
[1]Li Z,Qin L,Cheng H,et al.TRIP:An interactive retrieving-inferring data imputation approach[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(9):2550-2563.
[2]ZHANG Wen,JIANG Yipan,YIN Guangda,et al.Handling missing values in software effort data based on na6ve Bayes and EM algorithm[J].Systems Engineering-Theory&Practice,2017,37(11):2965-2974(in Chinese).[张文,姜祎盼,殷广达,等.基于朴素贝叶斯和EM算法的软件工作量缺失数据处理方法[J].系统工程理论与实践,2017,37(11):2965-2974.]
[3]DING Chunrong,LI Longshu.Improved ROUSTIDA algorithm based on similarity relation vector[J].Computer Engineering&Applications,2014,50(13):133-136(in Chinese).[丁春荣,李龙澍.基于相似关系向量的改进ROUSTI-DA算法[J].计算机工程与应用,2014,50(13):133-136.]
[4]HAO Shengxuan,SONG Hong,ZHOU Xiaofeng.Predicting missing values with knn based on the elimination of neighbor noise[J].Computer Simulation,2014,31(7):264-268(in Chinese).[郝胜轩,宋宏,周晓锋.基于近邻噪声处理的KNN缺失数据填补算法[J].计算机仿真,2014,31(7):264-268.]
[5]Tang F,Ishwaran H.Random forest missing data algorithms[J].Statistical Analysis&Data Mining the Asa Data Science Journal,2017.
[6]ZHANG Xiaoqin,CHENG Yuying.Imputation of missing values for compositional data based on random forest[J].Chinese J Appl Probab Statist,2017,33(1):102-110(in Chinese).[张晓琴,程誉莹.基于随机森林模型的成分据缺失值填补法[J].应用概率统计,2017,33(1):102-110.]
[7]Fauer S,Schwenker F.Semi-supervised clustering of large data sets with kernel methods[J].Pattern Recognition Letters,2014,37(1):78-84.
[8]Wang T,Zhao D,Tian S.An overview of kernel alignment and its applications[J].Artificial Intelligence Review,2015,43(2):179-192.
[9]Piciarelli C,Micheloni C,Foresti G L.Kernel-based clustering[J].Electronics Letters,2013,49(2):113-U7.
[10]WANG Peiyan,CAI Dongfeng.Distance-based kernel evaluation measure[J].Computer Science,2014,41(2):72-75(in Chinese).[王裴岩,蔡东风.一种基于核距离的核函数度量方法[J].计算机科学,2014,41(2):72-75.]
[11]Kim S,Cho N W,Lee Y J,et al.Application of densitybased outlier detection to database activity monitoring[J].Information Systems Frontiers,2013,15(1):55-65.