时空大数据背景下并行数据处理分析挖掘的进展及趋势
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research progress and trends of parallel processing, analysis, and mining of big spatiotemporal data
  • 作者:关雪峰 ; 曾宇媚
  • 英文作者:GUAN Xuefeng;ZENG Yumei;State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing,Wuhan University;
  • 关键词:时空大数据 ; 高性能软硬件 ; 并行空间分析 ; 数据挖掘 ; 进展及趋势
  • 英文关键词:big spatiotemporal data;;high-performance computing;;parallel spatial analysis;;data mining;;progress and trends
  • 中文刊名:DLKJ
  • 英文刊名:Progress in Geography
  • 机构:武汉大学测绘遥感信息工程国家重点实验室;
  • 出版日期:2018-10-30 13:43
  • 出版单位:地理科学进展
  • 年:2018
  • 期:v.37
  • 基金:国家自然科学基金项目(41301411)~~
  • 语种:中文;
  • 页:DLKJ201810002
  • 页数:14
  • CN:10
  • ISSN:11-3858/P
  • 分类号:14-27
摘要
随着互联网、物联网和云计算的高速发展,与时间、空间相关的数据呈现出"爆炸式"增长的趋势,时空大数据时代已经来临。时空大数据除具备大数据典型的"4V"特性外,还具备丰富的语义特征和时空动态关联特性,已经成为地理学者分析自然地理环境、感知人类社会活动规律的重要资源。然而在具体研究应用中,传统数据处理和分析方法已无法满足时空大数据高效存取、实时处理、智能挖掘的性能需求。因此,时空大数据与高性能计算/云计算融合是必然的发展趋势。在此背景下,本文首先从大数据的起源出发,回顾了大数据概念的发展历程,以及时空大数据的特有特征;然后分析了时空大数据研究应用产生的性能需求,总结了底层平台软硬件的发展现状;进而重点从时空大数据的存储管理、时空分析和领域挖掘3个角度对并行化现状进行了总结,阐述了其中存在的问题;最后指出了时空大数据研究发展趋势。
        With the rapid development of the Internet,Internet of things,and cloud computing technology,data with geographical location and time tag are accumulated in an explosive way,and this indicates that we are in the era of big spatiotemporal data.In addition to the typical"4V"characteristics,big spatiotemporal data also contain rich semantic information and dynamic spatiotemporal patterns.Although massive spatiotemporal data have promoted the evolvement of various cross-disciplinary studies,traditional methods of data processing and analysis would no longer meet the requirements of efficient storage and real-time analysis of such data.Therefore,it is of great importance to integrate big spatiotemporal data with high-performance computing/cloud computing.To address this problem,this article begins with the concept and origin of big spatiotemporal data,and introduces its unique characteristics.Then,the performance requirements generated by current big data applications are analyzed,and the status quo of the underlying hardware and software is summarized.Furthermore,the article comprehensively reviews parallel processing,analysis,and mining methods for big spatiotemporal data.Finally,we conclude with the challenges and opportunities of storage,management,and parallel processing analysis of big spatiotemporal data.
引文
程果,景宁,陈荦,等.2012.栅格数据处理中邻域型算法的并行优化方法[J].国防科技大学学报,34(4):114-119.[Cheng G,Jing N,Chen L,et al.2012.Parallel optimization methods for raster data processing algorithms of neighborhood-scope[J].Journal of National University of Defense Technology,34(4):114-119.]
    杜江,张铮,张杰鑫,等.2015.MapReduce并行编程模型研究综述[J].计算机科学,42(S1):537-541.[Du J,Zhang Z,Zhang J X.2015.Survey of MapReduce parallel programming model[J].Computer Science,42(S1):537-541.]
    贾婷,魏祖宽,唐曙光,等.2010.一种面向并行空间查询的数据划分方法[J].计算机科学,37(8):198-200.[Jia T,Wei Z K,Tang S G,et al.2010.New spatial data partition approach for spatial data query[J].Computer Science,37(8):198-200.]
    隽志才,倪安宁,贾洪飞,等.2006.两种策略下的最短路径并行算法研究与实现[J].系统工程理论方法应用,15(2):123-127.[Jun Z C,Ni A N,Jia H F,et al.2006.Study andimplement of shortest path parallel algorithms with two strategies[J].System Engineering-Theory Methodology Applications,15(2):123-127.]
    雷德龙,郭殿升,陈崇成.2014.基于MongoDB的矢量空间数据云存储与处理系统[J].地球信息科学学报,16(4):507-516.[Lei D L,Guo D S,Chen C C.2014.Vector spatial data cloud storage and processing based on MongoDB[J].Journal of Geo-information Science,16(4):507-516.]
    李德仁,马军,邵振峰.2015.论时空大数据及其应用[J].卫星应用,(9):7-11.[Li D R,Ma J,Shao Z F.2015.The application of spatial temporal big data[J].Satellite Application,(9):7-11.]
    李德仁,王树良,李德毅.2013.空间数据挖掘理论与应用[M].北京:科学出版社.[Li D R,Wang S L,Li D Y.2013.Spatial data mining theories and applications[M].Beijing,China:Science Press.]
    李德仁,张良培,夏桂松.2014.遥感大数据自动分析与数据挖掘[J].测绘学报,43(12):1211-1216.[Li D R,Zhang LP,Xia G S.2014.Automatic analysis and mining of remote sensing big data[J].Acta Geodaeticaet Cartographica Sinica,43(12):1211-1216.]
    李建江,崔健,王聃,等.2011.MapReduce并行编程模型研究综述[J].电子学报,39(11):2635-2642.[Li J J,Cui J,Wang D,et al.2011.Survey of MapR educe parallel programming model[J].Acta Electronica Sinica,39(11):2635-2642.]
    李绍俊,杨海军,黄耀欢,等.2017.基于NoSQL数据库的空间大数据分布式存储策略与实践[J].武汉大学学报:信息科学版,42(2):163-169.[Li S J,Yang H J,Huang Y H,et al.2017.Geo-spatial big data storage based on NoSQLdatabase[J].Geomatics and Information Science of Wuhan University,42(2):163-169.]
    廉捷.2013.基于用户特征的社交网络数据挖掘研究[D].北京:北京交通大学.[Lian J.2013.Research on user features based data mining in social networks[D].Beijing,China:Beijing Jiaotong University.]
    廖理.2015.基于Neo4J图数据库的时空数据存储[J].信息安全与技术,6(8):43-45.[Liao L.2015.Application research of Neo4J in spatio-temporal data storage[J].Information Security and Technology,6(8):43-45.]
    刘润涛,安晓华,高晓爽.2009.一种基于R-树的空间索引结构[J].计算机工程,35(23):32-34.[Liu R T,An X H,Gao X S.2009.Spatial index structure based on R-tree[J].Computer Engineering,35(23):32-34.]
    卢风顺,宋君强,银福康,等.2011.CPU/GPU协同并行计算研究综述[J].计算机科学,38(3):5-10.[Lu F S,Song J Q,Yin F K,et al.2011.Survey of CPU/GPU synergetic parallel computing[J].Computer Science,38(3):5-10.]
    卢俊,张保明,黄薇,等.2009.基于GPU的遥感像数据融合IHS变换算法[J].计算机工程,35(7):261-263.[Lu J,Zhang B M,Huang W,et al.2009.IHS transform algorithm of remote sensing image data fusion based on GPU[J].Computer Engineering,35(7):261-263.]
    卢照,师军.2010.并行最短路径搜索算法的设计与实现[J].计算机工程与应用,46(3):69-71.[Lu Z,Shi J.2010.Design and implementation of parallel shortest path search algorithm[J].Computer Engineering and Applications,46(3):69-71.]
    罗俊.2016.数据挖掘算法的并行化研究及其应用[D].青岛:青岛大学,[Luo J.2016.Research on parallelization of data mining algorithm and application[D].Qingdao,China:Qingdao University.]
    马林.2009.数据重现:文件系统原理精解与数据恢复最佳实践[M].北京:清华大学出版社.[Ma L.2009.Shuju chongxian:Wenjian xitong yuanli jingjie yu shuju huifu zuijia shijian[M].Beijing,China:Tsinghua University Press.]
    马义松,武志刚.2016.基于Neo4J的电力大数据建模及分析[J].电工电能新技术,35(2):24-29.[Ma Y S,Wu Z G.2016.Modeling and analysis of big data for power grid based on Neo4J[J].Advanced Technology of Electrical Engineering and Energy,35(2):24-29.]
    孟小峰,慈祥.2013.大数据管理:概念、技术与挑战[J].计算机研究与发展,50(1):146-169.[Meng X F,Ci X.2013.Big data management:Concepts,techniques and challenges[J].Journal of Computer Research and Development,50(1):146-169.]
    彭晓明,郭浩然,庞建民.2012.多核处理器:技术、趋势和挑战[J].计算机科学,39(Z3):320-326.[Peng X M,Guo HR,Pang J M.2012.Mutil-core processor:Technology,tendency and challenge[J].Computer Science,39(Z3):320-326.]
    田帅.2013.一种基于MongoDB和HDFS的大规模遥感数据存储系统的设计与实现[D].杭州:浙江大学.[Tian S.2013.A design and implementation of large-scale remote sensing data storage system based on MongoDB and HDFS[D].Hangzhou,China:Zhejiang University.]
    王鸿琰,关雪峰,吴华意.2017.一种面向CPU/GPU异构环境的协同并行空间插值算法[J].武汉大学学报:信息科学版,42(12):1688-1695.[Wang H Y,Guan X F,Wu H Y.2017.A collaborative parallel spatial interpolation algorithm on oriented towards the heterogeneous CPU/GPUsystem[J].Geomatics and Information Science of Wuhan University,42(12):1688-1695.]
    王凯,曹建成,王乃生,等.2015.Hadoop支持下的地理信息大数据处理技术初探[J].测绘通报,(10):114-117.[Wang K,Cao J C,Wang N S,et al.2015.Research on GIS big data computing technologies based on Hadoop[J].Bulletin of Surveying and Mapping,(10):114-117.]
    夏大文.2016.基于MapReduce的移动轨迹大数据挖掘方法与应用研究[D].重庆:西南大学.[Xia D W.2016.MapReduce:Based methodologies of mobile trajectory big data mining and its applications[D].Chongqing,China:Southwest University.]
    谢欢.2015.大数据挖掘中的并行算法研究及应用[D].成都:电子科技大学.[Xie H.2015.Research and application on the parallel algorithm in big data mining[D].Chengdu,China:University of Electronic Science and Technology of China.]
    闫密巧,王占宏,王志宇.2017.基于Redis的海量轨迹数据存储模型研究[J].微型电脑应用,33(4):9-11.[Yan M Q,Wang Z H,Wang Z Y.2017.Large-scale trajectory data storage model based on Redis[J].Microcomputer Applications,33(4):9-11.]
    杨洪余,李成明,王小平,等.2017.CPU/GPU异构环境下图像协同并行处理模型[J].集成技术,6(5):8-18.[Yang HY,Li C M,Wang X P,et al.2017.Image cooperative parallel processing model in CPU/GPU heterogeneous environment[J].Journal of Integration Technology,6(5):8-18.]
    杨靖宇,张永生,董广军.2010.基于GPU的遥感影像SAM分类算法并行化研究[J].测绘科学,35(3):9-11.[Yang JY,Zhang Y S,Dong G J.2010.Investigation of parallel method of RS image SAM algorithmic based on GPU[J].Science of Surveying and Mapping,35(3):9-11.]
    殷进勇,杨阳,徐振朋,等.2015.计算存储融合:从高性能计算到大数据[J].指挥控制与仿真,37(3):1-7.[Yin J Y,Yang Y,Xu Z P,et al.2015.The fusion of computing and storage:From HPC to big data[J].Command Control&Simulation,37(3):1-7.]
    尹芳,冯敏,诸云强,等.2013.基于开源Hadoop的矢量空间数据分布式处理研究[J].计算机工程与应用,49(16):25-29.[Yin F,Feng M,Zhu Y Q,et al.2013.Research on vector spatial data distributed computing using Hadoop projects[J].Computer Engineering and Applications,49(16):25-29.]
    张飞龙.2016.基于MongoDB遥感数据存储管理策略的研究[D].开封:河南大学.[Zhang F L.2016.Research on the storage management strategy of remote sensing data base on MongoDB[D].Kaifeng,China:Henan University.]
    张景云.2013.基于Redis的矢量数据组织研究[D].南京:南京师范大学.[Zhang J Y.2013.Vector data organization research based on Redis[D].Nanjing,China:Nanjing Normal University.]
    张晓兵.2016.基于HBase的弹性可视化遥感影像存储系统[D].杭州:浙江大学.[Zhang X B.2016.An HBase based remote sensing elastic visualization storage system[D].Hangzhou,China:Zhejiang University.]
    赵永华,迟学斌.2005.基于SMP集群的MPI+OpenMP混合编程模型及有效实现[J].微电子学与计算机,22(10):7 -11.[Zhao Y H,Chi X B.2005.MPI+OpenMP hybrid paradigms and efficient implementation base on SMP clusters[J].Microelectronics&Computer,22(10):7-11.]
    郑坤,付艳丽.2015.基于HBase和GeoTools的矢量空间数据存储模型研究[J].计算机应用与软件,32(3):23-26.[Zheng K,Fu Y L.2015.Research on vector spatial data storage model based on HBase and GeoTools[J].Computer Applications and Software,32(3):23-26.]
    朱效民,潘景山,孙占全,等.2013.基于OpenMP的两个地学基础空间分析算法的并行实现及优化[J].计算机科学,40(2):8-11.[Zhu X M,Pan J S,Sun Z Q.2013.Parallel implementation and optimization of two basic geospatial-analysis algorithms based on OpenMP[J].Computer Science,40(2):8-11.]
    Beaver D,Kumar S,Li H C,et al.2010.Finding a needle in haystack:Facebook's photo storage[C]//Usenix conference on operating systems design and implementation.USENIXAssociation:47-60.
    Chang F,Dean J,Ghemawat S,et al.2008.Bigtable:A distributed storage system for structured data[J].ACM Transactions on Computer System.26(2):1-26.
    Cheng B,Guan X F,Wu H Y,et al.2016.Hypergraph+:An improved hypergraph-based task-scheduling algorithm for massive spatial data processing on master-slave platforms[J].ISPRS International Journal of Geo-Information,5(8):141-157.
    Chester S,Crowe J.Exploraions of parallel fp_growth[EB/OL].2011-08-13[2018-8-31].http://webhome.csc.uvic.ca/schester/.
    Dagum L,Menon R.1998.OpenMP:An industry standard API for shared-memory programming[J].IEEE Computational Science&Engineering,5(1):46-55.
    Dean J,Ghemawat S.2004.MapReduce:Simplified data processing on large clusters[J].Sixth Symposium on Operating System Design and Implementation,51(1):137-150.
    Dinan J,Balaji P,Buntinas D,et al.2016.An implementation and evaluation of the MPI 3.0 one-sided communication interface[J].Concurrency and Computation:Practice and Experience,28(17):4385-4404.
    Do H-T,Limet S,Melin E.2011.Parallel computing flow accumulation in large digital elevation models[J].Procedia Computer Science,4(4):2277-2286.
    Garland M,Grand S L,Nickolls J,et al.2008.Parallel computing experiences with CUDA[J].IEEE Micro,28(4):13-27.
    Ghemawat S,Gobioff H,Leung S T.2003.The Google file system[J].Proceedings of SOSP 2003,Operating Systems Review,37(5):29-43.
    HDFS.2012.HDFS architecture guide[EB/OL].2012-10-02[2018-08-31].http://hadoop.apache.org/docs/hdfs/r0.22.0/hdfs_design.html.
    Hecht R,Jablonski S.2012.NoSQL evaluation:A use case oriented survey[C]//International conference on cloud and service computing(ICSC).IEEE,336-341.
    Hong S,Oguntebi T,Olukotun K.2011.Efficient parallel graph exploration on multi-core CPU and GPU[C]//International conference on parallel architectures and compilation techniques.IEEE Computer Society:78-88.
    Javier D,Camelia M-C,Alfonso N.2012.A survey of parallel programming models and tools in the multi and many-core era[J].IEEE Transactions on Parallel and Distributed System,23(8):1369-1386.
    Langendoen H F.1995.Parallelizing the polygon overlay problem using Orca[D].Amsterdam,Holland:Vrije Universiteit Amsterdam.
    Lanthier M,Nussbaum D,Sack J R.2003.Parallel implementation of geometric shortest path algorithms[J].Parallel Computing,29(10):1445-1479.
    Li X,Li D R.2014.Can night-time light images play a role in evaluating the Syrian Crisis[J].International Journal of Remote Sensing,35(18):6648-6661.
    Manyika J,Chui M,Brown B,et al.2011.Big data:The next frontier for innovation,competition,and productivity[R].Chicago,IL:The McKinsey Global Institute:1-156.
    Nickolls J,Dally W J.2010.The GPU computing era[J].IEEEMicro,30(2):56-69.
    NoSQL.2009.NoSQL definition:Next generation databases mostly addressing some of the points:Being non-relational,distributed,open-source and horizontally scalable[EB/OL].2009-06-11[2018-08-31].http://nosql-database.org/.
    NVIDIA.2017.NVIDIA Tesla V100 GPU architecture:The world's most advanced data center GPU[J/OL].2017-08-30[2018-08-31].https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
    Qatawneh M,Sleit A,Almobaideen W.2009.Parallel implementation of polygons clipping using transputer[J].American Journal of Applied Sciences,6(2):214-218.
    Qin C Z,Zhan L J.2012.Parallelizing flow-accumulation calculations on graphics processing units:From iterative DEM preprocessing algorithm to recursive multiple-flowdirection algorithm[J].Computers&Geosciences,43(6):7-16.
    Qin C Z,Zhan L J,Zhu A X,et al.2014.A strategy for rasterbased geocomputation under different parallel computing platforms[J].International Journal of Geographical Information Science,28(11):2127-2144.
    Waldrop M.2008.Big data:Wikiomics[J].Nature,455:22-25.
    Wilson G V.1994.Assessing the usability of parallel programming systems:The Cowichan problems[M]//Decker K M,Rehmann R M.Programming environments for massively parallel distributed systems.Basal,Switzerland:Birkh?user:183-193.
    Wu H Y,Guan X F,Gong J Y.2011.ParaStream:A parallel streaming delaunay triangulation algorithm for lidar points on multicore architectures[J].Computers&Geosciences,37(9):1355-1363.
    Xu G H.1999.Pay much attention to the digital earth by the social[J].Science News Weekly,(1):7-8.
    Xu M,Cao H,Wang C Y.2014.Raster-based parallel multiplicatively weighted voronoi diagrams algorithm with MapR educe[M]//Cao B Y,Ma S Q,Cao H H.Ecosystem assessment and fuzzy systems management.New York:Springer International Publishing:177-188.
    Zaharia M,Xin R S,Wendell P,et al.2016.Apache spark:Aunified engine for big data processing[J].Communications of the ACM,59(11):56-65.
    Zhang T H,Zhu Z M,Gong W,et al.2018.Estimation of ultrahigh resolution PM2.5concentrations in urban areas using160 m Gaofen-1 AOD retrievals[J].Remote Sensing of Environment,216(10):91-104.
    Zhao M,Cheng W M,Zhou C H,et al.2018.Assessing spatiotemporal characteristics of urbanization dynamics in Southeast Asia using time series of DMSP/OLS nighttime light data[J].Remote Sensing,10(1):47-66.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700