基于双决策树的数据采样方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于双决策树的数据采样方法

详细信息查看全文 | 推荐本文 |

英文篇名：A data sampling method based on double decision tree
作者：陈力 ; 费洪晓 ; 丁海伦 ; 成琳 ; 翟纪宇
英文作者：CHEN Li;FEI Hong-xiao;DING Hai-lun;CHENG Lin;ZHAI Ji-yu;School of Geosciences and Info-Physics,Central South University;School of Software,Central South University;
关键词：决策树 ; 数据采样 ; 机器学习
英文关键词：decision tree;;data sampling;;machine learning
中文刊名：JSJK
英文刊名：Computer Engineering & Science
机构：中南大学地球科学与信息物理学院;中南大学软件学院;
出版日期：2019-01-15
出版单位：计算机工程与科学
年：2019
期：v.41;No.289
基金：国家自然科学基金(61602525);; 中南大学2017年本科生自由探索项目(201710533267,ZY20170769)
语种：中文;
页：JSJK201901017
页数：6
CN：01
ISSN：43-1258/TP
分类号：134-139

摘要

在数据挖掘问题中,一个基本假设是训练集样本与测试集样本的数据分布一致,但随着数据量逐渐增加,如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现,传统的简单随机抽样和渐进抽样等数据选择方法,由于没有和数据挖掘工具进行结合,采样结果具有偶然性和不确定性,抽样数据很难保证数据挖掘的基本假设,这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题,提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树,借助决策树在数据源中选择适合的数据和数据采集点,同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明,与简单随机抽样相比,新采样数据下训练的模型准确率有明显提高。
In data mining,a basic assumption is that the data distribution of training set samples are consistent with that of test set samples.But as data volumes increase,how to find out representative data in huge amounts of data becomes particularly difficult.By studying existing data selection methods,we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool,such as simple random sampling and progressive sampling.Due to contingency factors and uncertainty,it is difficult to guarantee the basic assumptions of data mining,which also makes the generalization error of the model larger.In order to solve these problems,we put forward a structured data sampling method based on double decision tree.Firstly,we generate a decision tree with the C4.5 algorithm,which is used to select appropriate data and data collection points in the data source.Then,we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality.Experiments show that compared with random sampling,the accuracy of the model based on our sampling is improved obviously.

引文

[1]Tu Xin-li,Liu Bo,Lin Wei-wei.Survey of big data[J].Application Research of Computers,2014,31(6):1612-1616.(in Chinese)
    [2]Feng Deng-guo,Zhang Min,Li Hao.Big data security and privacy protection[J].Chinese Journal of Computers,2014,37(1):246-258.(in Chinese)
    [3]Shi Meng-chu.The application of data mining in the era of big data[J].China New Telecommunications,2017,19(8):88.(in Chinese)
    [4]Hu Wen-yu,Sun Zhi-hui,Wu Ying-jie.Study of sampling methods on data mining and stream mining[J].Journal of Computer Research and Development,2011,48(1):45-54.(in Chinese)
    [5]Yang Gang.Discussion on several problems in simple random sampling[J].Journal of Xuchang University,2012,31(5):22-24.(in Chinese)
    [6]Provost F,Jensen D,Oates T.Efficient progressive sampling[C]∥Proc of International Conference on Knowledge Discovery&Data Mining,1999:23-32.
    [7]Zhang Z,He L,Tan Y,et al.A heuristic approximately duplicate records detection algorithm based on attributes analysis[J].International Journal of Digital Content Technology&Its Applications,2012,6(4):259-267.
    [8]Yu Xiao-sheng,Hu Sun-zhi.Research on eliminating duplicate records based on SNM improved algorithm[J].Journal of Chongqing University of Technology,2016,30(4):91-96.(in Chinese)
    [9]Feng Fan,Xu Jun-gang.Research on decision tree algorithm&its application in CRM system[J].Electronic Technology,2012,39(6):7-10.(in Chinese)
    [10]Yang Yang,Huang Chen,Li Jun.The present situation and qualitative comparison of typical sampling methods in China[J].Modern Economic Information,2015(5):127-128.(in Chinese)
    [11]Liu Ya-si,Cheng Li,Li Xiao.Improved SNM algorithm based on length filtering and dynamic fault-tolerance[J].Application Research of Computers,2017,34(1):147-150.(in Chinese)
    [12]Guo Wen-long.Improved SNM algorithm based on length filtering and effective weights[J].Computer Engineering and Applications,2014,50(19):123-127.(in Chinese)
    [13]Xu Bi-xiao,Chen Sheng-bo,Han Chong-yang,et al.Improved data preprocessing algorithm and its application[J].Computer Technology and Development,2015,25(12):143-146.(in Chinese)
    [14]Yang Xiao-dong,Li Jun,Wang Ji-rong,et al.The optimization of SNM algorithm based on incremental adaptive[J].Journal of Qingdao University(Natural Science Edition),2017,30(2):53-57.(in Chinese)
    [1]涂新莉,刘波,林伟伟.大数据研究综述[J].计算机应用研究,2014,31(6):1612-1616.
    [2]冯登国,张敏,李昊.大数据安全与隐私保护[J].计算机学报,2014,37(1):246-258.
    [3]史梦楚.数据挖掘在大数据时代下的应用[J].中国新通信,2017,19(8):88.
    [4]胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54.
    [5]杨刚.简单随机抽样中几个问题的探讨[J].许昌学院学报,2012,31(5):22-24.
    [8]余肖生,胡孙枝.基于SNM改进算法的相似重复记录消除[J].重庆理工大学学报,2016,30(4):91-96.
    [9]冯帆,徐俊刚.C4.5决策树改进算法研究[J].电子技术,2012,39(6):7-10.
    [10]杨扬,黄辰,李俊.我国典型抽样方法的研究现状及定性比较[J].现代经济信息,2015(5):127-128.
    [11]刘雅思,程力,李晓.基于长度过滤和动态容错的SNM改进算法[J].计算机应用研究,2017,34(1):147-150.
    [12]郭文龙.基于长度过滤和有效权值的SNM改进算法[J].计算机工程与应用,2014,50(19):123-127.
    [13]许必宵,陈升波,韩重阳,等.改进的数据预处理算法及其应用[J].计算机技术与发展,2015,25(12):143-146.
    [14]杨晓东,李军,王继荣,等.基于增量自适应的邻近排序算法优化[J].青岛大学学报(自然科学版),2017,30(2):53-57.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700