大数据随机样本划分模型及相关分析计算技术

英文篇名：Random Sample Partition Data Model and Related Technologies for Big Data Analysis
作者：黄哲学 ; 何玉林 ; 魏丞昊 ; 张晓亮
英文作者：Huang Zhexue;He Yulin;Wei Chenghao;Zhang Xiaoliang;Big Data Institute, College of Computer Science & Software Engineering, Shenzhen University;National Engineering Laboratory for Big Data System Computing Technology;
关键词：大数据 ; 随机样本划分 ; 渐近式集成学习 ; 人工智能
英文关键词：big data;;random sample partition;;asymptotic ensemble learning;;artificial intelligence
中文刊名：SJCJ
英文刊名：Journal of Data Acquisition and Processing
机构：深圳大学计算机与软件学院大数据技术与应用研究所;深圳大学大数据系统计算技术国家工程实验室;
出版日期：2019-05-15
出版单位：数据采集与处理
年：2019
期：v.34;No.155
基金：国家重点研发计划(2017YFC0822604-2)资助项目;; 中国博士后科学基金(2016T90799)资助项目;; 深圳大学2018年度新引进教师科研启动基金(2018060)资助项目;; 广东省普通高校国家级重大培育基金(2014GKXM054)资助项目
语种：中文;
页：SJCJ201903001
页数：13
CN：03
ISSN：32-1367/TN
分类号：5-17

摘要

设计了一种新的适用于大数据的管理和分析模型——大数据随机样本划分(Random sample partition,RSP)模型,它是将大数据文件表达成一系列RSP数据块文件的集合,分布存储在集群节点上。RSP的生成操作使每个RSP数据块的分布与大数据的分布保持统计意义上的一致,因此,每个RSP数据块是大数据的一个随机样本数据,可以用来估计大数据的统计特征,或建立大数据的分类和回归模型。基于RSP模型,大数据的分析任务可以通过对RSP数据块的分析来完成,不需要对整个大数据进行计算,极大地减少了计算量,降低了对计算资源的要求,提高了集群系统的计算能力和扩展能力。本文首先给出RSP模型的定义、理论基础和生成方法;然后介绍基于RSP数据块的渐近式集成学习Alpha计算框架;之后讨论基于RSP模型和Alpha框架的大数据分析相关计算技术,包括:数据探索与清洗、概率密度函数估计、有监督子空间学习、半监督集成学习、聚类集成和异常点检测;最后讨论RSP模型在分而治之大数据分析和抽样方法上的创新,以及RSP模型和Alpha计算框架实现大规模数据分析的优势。
Random sample partition(RSP)data model distributedly represents a big data set as a set of RSP data blocks stored on a computing cluster. The RSP data model guarantees that the probability distribution of each data block is statistically consistent to the probability distribution of whole big data set. Thus,each RSP data block is a random sample of big data set and can be used to estimate the statistical properties of big data set or establish the classification and regression models. Based on the RSP data model,the big data analysis can be conducted by analyzing RSP data blocks rather than the whole big data set. This significantly reduces the computational complexity and improves the computing performance of cluster system on big data analysis. In this paper,we firstly present the definition,basic theory and generation method of RSP. Second,we introduce an asymptotic ensemble learning framework called Alpha framework used for big data analysis. Third,we discuss the main big data analysis methods based on the RSP data model and Alpha framework,including data exploration & cleaning,probability density function estimation,supervised subspace learning,semi-supervised ensemble learning,clustering ensemble and outlier detection. Finally,we discuss the innovations and advantages of the RSP data model and Alpha framework in big data analysis by using the divide-and-conquer strategy on random samples.

引文

[1]陈国良.大数据聚类专题序言[J].深圳大学学报(理工版), 2019, 36(1):1-3.Chen Guoliang. Editorial of special issue on big data clustering[J]. Journal of Shenzhen University Science and Engineering,2019, 36(1):1-3.
    [2]黄晓云.基于HDFS的云存储服务系统研究[D].大连:大连海事大学, 2010.Huang Xiaoyun. Cloud storage service system based on HDFS[D]. Dalian:Dalian Maritime University, 2010.
    [3]付东华.基于HDFS的海量分布式文件系统的研究与优化[D].北京:北京邮电大学, 2012.Fu Donghua. Research and improvement of the massive distributed file system based on HDFS[D]. Beijing:Beijing University of Posts and Telecommunications, 2012.
    [4] Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113.
    [5] Dean J, Ghemawat S. MapReduce:A flexible data processing tool[J]. Communications of the ACM, 2010, 53(1):72-77.
    [6] Salloum S, Dautov R, Chen X J, et al. Big data analytics on Apache Spark[J]. International Journal of Data Science and Analytics, 2016, 1(3/4):145-164.
    [7]张滨.基于MapReduce大数据并行处理的若干关键技术研究[D].上海:东华大学, 2017.Zhang Bin. Research on some key technologies of parallel processing for big data based on MapReduce[J]. Shanghai:Donghua University, 2017.
    [8]李志斌.基于MapReduce并行处理框架的大数据处理系统的研究[D].吉林:吉林大学, 2018.Li Zhibin. Research on big data processing system based on MapReduce parallel processing framework[D]. Jilin:Jilin University, 2018.
    [9]王晨曦,吕方,崔慧敏,等.面向大数据处理的基于Spark的异质内存编程框架[J].计算机研究与发展, 2018, 55(2):246-264.Wang Chenxi, LüFang, Cui Huimin,et al. Heterogeneous memory programming framework based on Spark for big data processing[J]. Journal of Computer Research and Development, 2018, 55(2):246-264.
    [10]宋泊东,张立臣,江其洲.基于Spark的分布式大数据分析算法研究[J].计算机应用与软件, 2019, 36(1):39-44.Song Bodong, Zhang Lichen, Jiang Qizhou. Distributed big data analysis algorithm based on Spark[J]. Computer Applications and Software, 2019, 36(1):39-44.
    [11]吴信东,嵇圣硙. MapReduce与Spark用于大数据分析之比较[J].软件学报, 2018, 29(6):1770-1791.Wu Xindong, Ji Shengwei. Comparative study on MapReduce and Spark for big data analytics[J]. Journal of Software, 2018, 29(6):1770-1791.
    [12] Salloum S, He Y L, Huang Z X, et al. A random sample partition data model for big data analysis[EB/OL].[2018-01-02](2019-03-01). https://arxiv.org/abs/1712.04146.
    [13] Gretton A, Borgwardt K M, Rasch M J, et al. A kernel two-sample test[J]. Journal of Machine Learning Research, 2012, 13:723-773.
    [14]魏丞昊,黄哲学,何玉林,等.基于统计感知的大数据系统计算框架[J].深圳大学学报(理工版), 2018, 35(5):441-443.Wei Chenghao, Huang Zhexue, He Yuling, et al. Statistical aware based big data system computing framework[J]. Journal of Shenzhen University Science and Engineering, 2018, 35(5):441-443.
    [15] Wei C H, Salloum S, Emara T, et al. A two-stage data processing algorithm to generate random sample partitions for big data analysis[C]//2018 International Conference on Cloud Computing. Cypru:Springer, 2018:347-364.
    [16] Salloum S, Huang Z X, He Y L. Empirical analysis of asymptotic ensemble learning for big data[C]//2016 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. Shanghai, China:IEEE, 2016:8-17.
    [17] Salloum S, Huang J Z, He Y L, et al. An asymptotic ensemble learning framework for big data analysis[J]. IEEE Access,2019, 7:3675-3693.
    [18] Van der Vaart A W. Asymptotic statistics[M]. UK:Cambridge University Press, 2000.
    [19]蔡毅,朱秀芳,孙章丽,等.半监督集成学习综述[J].计算机科学, 2017, 44(S1):7-13.Cai Yi, Zhu Xiufang, Sun Zhangli, et al. Semi-supervised and ensemble learning:A review[J]. Computer Science, 2017, 44(S1):7-13.
    [20]杨草原,刘大有,杨博,等.聚类集成方法研究[J].计算机科学, 2011, 38(2):166-170.Yang Caoyuan, Liu Dayou, Yang Bo, et al. Research on cluster aggregation approaches[J]. Computer Science, 2011, 38(2):166-170.
    [21] Huang Z X, Ng M K. A note on k-modes clustering[J]. Journal of Classification, 2003, 20(2):257-261.
    [22] Huang Z X. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3):283-304.
    [23]曹科研,栾方军,孙焕良,等.不确定数据基于密度的局部异常点检测[J].计算机学报,2017, 40(10):2231-2244.Cao Keyan, Luan Fangjun, Sun Huanliang, et al. Survey on the management of uncertain data[J]. Chinese Journal of Computers, 2017, 40(10):2231-2244.
    [24] Vargas-Solar G, Zechinelli-Martini J L, Espinosa-Oviedo J A. Big data management:What to keep from the past to face future challenges[J]. Data Science and Engineering, 2017, 2(4):328-345.
    [25] Siuly S, Zhang Y. Medical big data:Neurological diseases diagnosis through medical data analysis[J]. Data Science and Engineering, 2016, 1(2):54-64.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700