分布式环境下时态大数据的连接操作研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Join Operation of Temporal Big Data in Distributed Environment
  • 作者:张伟 ; 王志杰
  • 英文作者:ZHANG Wei;WANG Zhijie;Department of Computer Science and Engineering,Shanghai Jiaotong University;School of Data and Computer Science,Sun Yat-Sen University;
  • 关键词:时态大数据 ; 分布式内存计算 ; 时态连接 ; 二级索引 ; 分区方法 ; Spark框架
  • 英文关键词:temporal big data;;distributed memory computing;;temporal join;;two-level index;;partition method;;Spark framework
  • 中文刊名:JSJC
  • 英文刊名:Computer Engineering
  • 机构:上海交通大学计算机科学与工程系;中山大学数据科学与计算机学院;
  • 出版日期:2018-12-01 13:45
  • 出版单位:计算机工程
  • 年:2019
  • 期:v.45;No.498
  • 基金:国家自然科学基金(U1636210,61729202);; 广东省科技计划项目(2015A030401057,2016B030307002)
  • 语种:中文;
  • 页:JSJC201903004
  • 页数:7
  • CN:03
  • ISSN:31-1289/TP
  • 分类号:26-31+37
摘要
目前处理时态大数据连接操作多数运用分布式系统,但现有的分布式系统尚不能支持原生的时态连接查询,无法满足时态大数据低延迟和高吞吐量的处理需求。为此,提出一个基于Spark的二级索引内存解决方案。运用全局索引进行分布式分区的剪枝,使用局部时态索引进行分区内查询,提高数据检索效率。针对时态数据设计分区方法,以对全局剪枝进行优化。基于真实和合成数据集的实验结果表明,与基准方案相比,该方案可明显提高时态连接操作的处理效率。
        Distributed system is an ideal choice for processing temporal large data join operation,but the existing distributed system cannot support the original temporal join query and cannot meet the processing requirements of temporal large data with low latency and high throughput.Therefore,a two-level index memory solution scheme based on Spark is proposed.The global index is used to prune the distributed partitions,and the local temporal index is used to query the partitions in order to improve the efficiency of data retrieval.A partition method is designed for temporal data to optimize global pruning.Experimental results based on real and synthetic datasets show that the scheme can significantly improve the processing efficiency of temporal join operation.
引文
[1] ZHANG S,YANG Y,FAN W,et al.OceanRT:real-time analytics over large temporal data[C]//Proceedings of 2014 ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,2014:1099-1102.
    [2] 周亮,李格非,邰伟鹏,等:基于Spark的时态查询扩展与时态索引优化研究[J].计算机工程,2017,43(7):22-28,37.
    [3] ZHANG D,TSOTRAS V L.Seeger:efficient temporal join processing using indices[C]//Proceedings of the 18th International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2002:401-412.
    [4] LU H,YANG B,JENSEN C S.Spatio-temporal joins on symbolic indoor tracking data[C]//Proceedings of the 27th IEEE International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2011:125-136.
    [5] BECKER B,GSCHWIND S,OHLER T,et al.Widmayer:an asymptotically optimal multiversion B-tree[J].The VLDB Journal,1996,5(4):264-275.
    [6] Block-based join algorithms[EB/OL].[2018-04-01].https://mariadb.com/kb/en/library/block-based-join-algorithms/.
    [7] LESKOVEC J,KREVL A.SNAP datasets:stanford large network dataset collection[EB/OL].[2018-05-21].http://snap.stanford.edu/data.
    [8] MAHMOOD A R,PUNNI S,AREF W G.Spatio-temporal access methods:a survey[EB/OL].[2018-05-21].https://link.springer.com.
    [9] CHENG K.On computing temporal aggregates over null time intervals[C]//Proceedings of International Conference on Database and Expert Systems Applications.Washington D.C.,USA:IEEE Press,2017:67-79.
    [10] KAUFMANN M,FISHHER P M,MAY N,et al.Bi-temporal timeline index:a data structure for processing queries on bi-temporal data[C]//Proceedings of ICDE’15.Washington D.C.,USA:IEEE Press,2015:215-226.
    [11] 周风华,汤庸,康向锋.一种有效的双时态索引技术[J].计算机工程与应用,2005,41(13):231-239.
    [12] WANG P,ZHANG P,ZHOU C,et al.Hierarchical evolving dirichlet processes for modeling nonlinear evolutionary traces in temporal data[J].Data Mining and Knowledge Discovery,2017,31(1):32-64.
    [13] LOGLISCI C,CECI M,MALERBA D.A temporal data mining framework for analyzing longitudinal data[C]//Proceedings of International Conference on Database and Expert Systems Applications.Washington D.C.,USA:IEEE Press,2011:154-165.
    [14] LE W,LI F,TAO Y,et al.Optimal splitters for temporal and multi-version databases[C]//Proceedings of SIGMOD’13.Washington D.C.,USA:IEEE Press,2013:321-329.
    [15] KAUFMANN M,MANJILI A,VAGENAS A P,et al.Timeline index:a unified data structure for processing queries on temporal data in SAP HANA[C]//Proceedings of SIGMOD’13.Washington D.C.,USA:IEEE Press,2013:124-132.
    [16] ELMASRI R,WUU G T,KIM Y J.The time index:an access structure for temporal data[C]//Proceedings of the 16th International Conference on Very Large Data Bases.Washington D.C.,USA:IEEE Press,1990:125-136.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700