基于收缩近邻方法的征信缺失数据插补研究

英文篇名：Research on Method of Credit Missing Data Imputation Based on Compress and Proximity
作者：夏利宇 ; 何晓群
英文作者：XIA Li-yu;HE Xiao-qun;Center for Applied Statistics,Renmin University of China;
关键词：征信数据 ; 缺失插补 ; 样本距离 ; 随机森林
英文关键词：credit data;;imputation;;sample distance;;random forest
中文刊名：SSJS
英文刊名：Mathematics in Practice and Theory
机构：中国人民大学应用统计科学研究中心;
出版日期：2017-04-23
出版单位：数学的实践与认识
年：2017
期：v.47
基金：教育部人文社会科学重点研究基地重大项目(15JJD910002)
语种：中文;
页：SSJS201708016
页数：7
CN：08
ISSN：11-2018/O1
分类号：149-155

摘要

在海量征信数据的背景下,为降低缺失数据插补的计算成本,提出收缩近邻插补方法.收缩近邻方法通过三阶段完成数据插补,第一阶段基于样本和变量的缺失比例计算入样概率,通过不等概抽样完成数据的收缩,第二阶段基于样本间距离,选取与缺失样本近邻的样本组成训练集,第三阶段建立随机森林模型进行迭代插补.利用Australian数据集和中国各银行数据集进行模拟研究,结果表明在确保一定插补精度的情况下,收缩近邻方法较大程度减少了计算量.
Massive credit data with large amount of samples and high dimensions pose serious problems of computational efficiency.This paper proposes a new missing data imputation method,called compress and proximity to tackle the problem.This method first compress the data through unequal probability samphng based on the proportion of missing data of samples and variables,then select the samples which proximity to incomplete samples to compose training data based on distance,last built the Random forest model to interpolate missing data by iterative.Australian credit scoring datasets and Chinese banks credit scoring datasets were selected for our simulation.Results show that our method reduced the computational load without decreasing too much accuracy of imputation.

引文

[1]Hand D J,and Henley W E.Statistical classification methods in consumer credit scoring:A review[J].Journal of the Royal Statistical Society,1997,160(3):523-541.
    [2]Fernandez-Delgado M,Cernadas E,Barro S,and Amorim D.Do we need hundreds of classifiers to solve real world classification problems?[J].Journal of Machine Learning Research,2014,15(1):3133-3181.
    [3]Subasi M M,Subasi E,Anthony M,and Hammer P L.A new imputation method for incomplete binary data[J].Discrete applied mathematics,2011,159(10):1040-1047.
    [4]Zhang S C.Nearest neighbor selection for iteratively kNN imputation[J].The Journal of Systems and Software,2012,85(11):2541-2552.
    [5]Stekhoven D J,and Buhlmann P.MissForest-non-parametric missing value imputation for mixedtype data[J].Bioinformatics,2012,28(1):112-118.
    [6]Hapfelmeier A and Ulm K.Variable selection by Random Forests using data with missing values[J].Computational Statistics and Data Analysis,2014,80(6):129-139.
    [7]肖进,刘敦虎,顾新,汪寿阳.银行客户信用评估动态分类器集成选择模型[J].管理科学学报,2015,18(3):114-126.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700