面向大数据流式计算的任务管理技术综述
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:A survey of task management techniques for big data stream computing
  • 作者:梁毅 ; 侯颖 ; 陈诚 ; 金翊
  • 英文作者:LIANG Yi;HOU Ying;CHEN Cheng;JIN Yi;College of Computer Science,Beijing University of Technology;Beijing Computing Center;
  • 关键词:大数据流式计算 ; 任务管理 ; 抽象功能模型 ; 资源分配 ; 数据分发 ; 容错
  • 英文关键词:big data stream computing;;task management;;abstract function model;;resource allocation;;data distribution;;fault tolerance
  • 中文刊名:JSJK
  • 英文刊名:Computer Engineering & Science
  • 机构:北京工业大学计算机学院;北京市计算中心;
  • 出版日期:2017-02-15
  • 出版单位:计算机工程与科学
  • 年:2017
  • 期:v.39;No.266
  • 基金:国家自然科学基金(61202075,91546111);; 北京市自然科学基金(4133081)
  • 语种:中文;
  • 页:JSJK201702001
  • 页数:12
  • CN:02
  • ISSN:43-1258/TP
  • 分类号:5-16
摘要
流式计算是大数据的一种重要计算模式,大数据流式计算已成为研究热点。任务管理是大数据流式计算的核心功能之一,负责对流式计算的任务进行资源调度及全生命周期管理。目前对于大数据流式计算的技术调研工作主要集中于流式计算应用需求、体系结构及整体技术,缺乏对大数据流式计算任务管理技术的精细化调研分析。首先给出流式计算任务管理的抽象功能模型,其次基于该模型对任务管理的关键技术进行了分类和综述,最后对既有主流的大数据流式计算系统对上述关键技术的应用、集成和优化进行了调研分析。
        Stream computing is an important part of big data computing,which has become a hot topic in big data research.Task management is one of the essential features of stream computing,and is responsible for resource scheduling and lifecycle management of stream computing tasks.Current researches focus on application requirements,architecture and overall technology of stream computing,and they are lack of dedicated investigation and analysis of task management techniques.Firstly,we present ageneral abstract function model of task management for stream computing systems.Secondly,we classify and analyze the key techniques for task management based on this model.Finally,we investigate their applications in current stream processing systems,and the integration and optimization of above techniques.
引文
[1]Brownlee N,Claffy K.Understanding Internet traffic streams:Dragonflies and tortoises[J].IEEE Communications Magazine,2002,40(10):110-117.
    [2]Gao J,Rubin I.Statistical properties of multiplicative multifractal processes in modelling telecommunications traffic streams[J].Electronics Letters,2000,36(1):101-102.
    [3]Pitman A,Zanker M.Insights from applying sequential pattern mining to E-commerce click stream data[C]∥Proc of2010IEEE International Conference on Data Mining Workshops(ICDMW),2010:967-975.
    [4]Chen H,Chiang R H L,Storey V C.Business intelligence and analytics:From big data to big impact[J].Mis Quarterly,2012,36(4):1165-1188.
    [5]Yang K L,Ling W,Ryu K H.A system architecture for monitoring sensor data stream[C]∥Proc of International Conference on Computer and Information Technology,2007:1026-1031.
    [6]Martinez-Julia P,Garcia E T,Ortiz Murillo J,et al.Evaluating video streaming in network architectures for the Internet of Things[C]∥Proc of International Conference on Innovative Mobile&Internet Services in Ubiquitous Computing,2013:411-415.
    [7]Ediger D,Riedy J,Bader D A,et al.Tracking structure of streaming social networks[C]∥Proc of 2011IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum,2011:1691-1699.
    [8]Borthakur D,Gray J,Sarma J S,et al.Apache Hadoop goes realtime at Facebook[C]∥Proc of ACM SIGMOD International Conference on Management of Data,2011:1071-1080.
    [9]Busch M,Gade K,Larson B,et al.Earlybird:Real-time search at Twitter[C]∥Proc of ICDE’12,2012:1360-1369.
    [10]Mishne G,Dalton J,Li Z,et al.Fast data in the era of big data:Twitter’s real-time related query suggestion architecture[C]∥Proc of SIGMOD’13,2012:1147-1158.
    [11]Goodhope K,Koshy J,Kreps J,et al.Building LinkedIn’s real-time activity data pipeline[J].IEEE Computer Society Technical Committee on Data Engineering,2012,35(2):33-45.
    [12]Groves P,Kayyali B,Knott D,et al.The“big data”revolution in healthcare[J].Mckinsey&Company,2013,4:13-16.
    [13]Yue Jian-ming,Yuan Lun-qu.Analysis of big data in the development of intelligent transportation[J].Productivity Research,2013(6):137-138.(in Chinese)
    [14]Lakshmi K P,Reddy C R K.A survey on different trends in data streams[C]∥Proc of International Conference on Network&Information Technology,2010:451-455.
    [15]Chandrasekaran S,Franklin M J.Streaming queries over streaming data[C]∥Proc of International Conference on Very Large Data Bases,2002:203-214.
    [16]Terry D,Goldberg D,Nichols D,et al.Continuous queries over append-only databases[J].Acm Sigmod Record,1992,21(2):321-330.
    [17]Sun Da-wei,Zhang Guang-yan,Zheng Wei-min.Big data stream computing:Technologies and instances[J].Journal of Software,2014,25(4):839-862.(in Chinese)
    [18]Stephens R.A survey of stream processing[J].Acta Informatica,1997,34(7):491-541.
    [19]Jeffrey D,Sanjay G.MapReduce:Simplified data processing on large clusters[C]∥Proc of Conference on Opearting Systems Design&Implementation,2004:107-113.
    [20]Balazinska M,Hwang J H,Shah M A.Fault-tolerance and high availability in data stream management systems[M].New York:Springer,2009:1109-1114.
    [21]Toshniwal A,Taneja S,Shukla A,et al.Storm@twitter[C]∥Proc of the 2014ACM SIGMOD International Conference on Management of Data,2014:147-156.
    [22]Storm.Available:https://storm-project.net.
    [23]Apache Zookeeper[EB/OL].[2014-05-01].http://zookeeper.apache.org.
    [24]Zookeeper[EB/OL].[2014-05-01].http://zookeeper.apache.org/doc/trunk/.
    [25]Hunt P,Konar M,Junqueira F P,et al.ZooKeeper:Waitfree coordination for Internet-scale systems[C]∥Proc of USENIXATC’10,2010:653-710.
    [26]Junqueira F.Distributed coordination via ZooKeeper.HiC2011[EB/OL].[2014-05-01].http://hic2011.hadooper.cn/dct/page/65591.
    [27]Aniello L,Baldoni R,Querzoni L.Adaptive online scheduling in Storm[C]∥Proc of the ACM International Conference on Distributed Event-Based Systems,2013:207-218.
    [28]Xu J,Chen Z,Tang J,et al.T-storm:Traffic-aware online scheduling in Storm[C]∥Proc of IEEE International Conference on Distributed Computing Systems,2014:535-544.
    [29]Cardellini V,Grassi V,Presti F L,et al.Distributed QoS-aware scheduling in storm[C]∥ACM International Conference on Distributed Event-Based Systems,2015:344-347.
    [30]Fu T Z J,Ding J,Ma R T B,et al.DRS:Dynamic resource scheduling for real-time analytics over fast streams[J].Computer Science,2015,690(1):411-420.
    [31]Veen J S V D,Waaij B V D,Lazovik E,et al.Dynamically scaling apache storm for the analysis of streaming data[C]∥Proc of IEEE the 1st International Conference on Big Data Computing Service and Applications,2015:154-161.
    [32]Xu L.Stela:On-demand elasticity in distributed data stream processing systems[D].Urbana-Champaign:Univeristy of Illinonis at Urbana-Champaign,2015.
    [33]Fischer L,Scharrenbach T,Bernstein A.Scalable linked data stream processing via network-aware workload scheduling[C]∥Proc of CEDR’13,2013:81-96.
    [34]Zaharia M,Das T,Li H,et al.Discretized streams:Fault-tolerant streaming computation at scale[C]∥Proc of the 24th ACM Symposium on Operating Systems Principles,2013:423-438.
    [35]Spark streaming[EB/OL].[2014-05-01].https://spark.apache.org/docs/latest/streaming-programming-guide.html.
    [36]Das T,Zhong Y,Stoica I,et al.Adaptive stream processing using dynamic batch sizing[C]∥Proc of ACM Symposium on Cloud Computing,2014:1-13.
    [37]Neumeyer L,Robbins B,Nair A,et al.S4:Distributed stream computing platform[C]∥Proc of IEEE International Conference on Data Mining Workshops,2010:170-177.
    [38]S4[EB/OL].[2014-05-01].http://incubator.apache.org/s4/.
    [39]Xhafa F,Naranjo V,Caballe S.Processing and analytics of big data streams with Yahoo!S4[C]∥Proc of IEEE International Conference on Advanced Information Networking&Applications,2015:263-270.
    [40]Apache Flink[EB/OL].[2015-05-01].https://flink.apache.org.
    [41]Kulkarni S,Bhagat N,Fu M,et al.Twitter Heron:Stream processing at scale[C]∥Proc of ACM SIGMOD International Conference on Management of Data,2015:239-250.
    [42]Abadi D J,Carney D,Cetintemel U,et al.Aurora:A new model and architecture for data stream management[J].Vldb Journal-the International Journal on Very Large Data Bases,2003,12(2):120-139.
    [43]HiC2011-Realtime data streams and analytics-data freeway and puma-Facebook[EB/OL].[2013-10-01].http://ishare.iask.sina.com.cn/f/22023896.html.
    [44]Qian Zheng-ping,He Yong,Su Chun-zhi,et al.TimeStream:Reliable stream computation in the cloud[C]∥Proc of the 8th ACM European Conference on Computer System,2013:1-14.
    [45]Apache Samza[EB/OL].[2015-05-01].http://samza.incubator.apache.org.
    [46]Kreps J,Corp L,Narkhede N,et al.Kafka:A distributed messaging system for log processing[C]∥Proc of the Netdb,2011:1.
    [47]Apache Kafka,a high-throughput distributed messaging system[EB/OL].[2013-10-01].http://kafka.apache.org/design.html.
    [48]Auradkar A,Botev C,Das S,et al.Data infrastructure at LinkedIn[C]∥Proc of IEEE 29th International Conference on Data Engineering,2012:1370-1381.
    [49]Sharma A.Apache Kafka:Next generation distributed messaging system[J].International Journal of Scientific Engineering and Technology Research,2014(3):9478-9483.
    [50]Vavilapalli V K,Murthy A C,Douglas C,et al.Apache Hadoop YARN:Yet another resource negotiator[C]∥Proc of Symposium on Cloud Computing.2013:1-16.
    [51]Akidau T,Balikov A,Bekiroglu K,et al.MillWheel:Faulttolerant stream processing at internet scale[J].Proccding of the VLDB Endowment,2013,6(11):1033-1044.
    [52]He B,Yang M,Guo Z,et al.Comet:Batched stream processing for data intensive distributed computing[C]∥Proc of SOCC’10,2010:63-74.
    [53]Abadi D J,Ahmad Y,Balazinska M,et al.The design of the borealis stream processing engine[C]∥Proc of CIDR’05,2005:277-289.
    [54]Ahmad Y,Berg B,Cetintemel U,et al.Distributed operation in the borealis stream processing engine[C]∥Proc of the2005ACM SIGMOD International Conference on Management of Data,2005:882-884.
    [55]Balazinska M,Balakrishnan H,Madden S R,et al.Fault-tolerance in the Borealis distributed stream processing system[J].ACM Transactions on Database Systems(TODS),2008,33(1):1-44.
    [56]Liu X,Iftikhar N,Xie X.Survey of real-time processing systems for big data[C]∥Proc of International Database Engineering&Applications Symposium,2014:356-361.
    [57]Zhang Peng,Li Peng-xiao,Ren Yan,et al.Distributed stream processing and technologies for big data:A review[J].Journal of Computer Research and Development,2014,51(S2):1-9.(in Chinese)
    [58]Cui Xing-can,Yu Xiao-hui,Liu Yang,et al.Distributed stream processing:A survey[J].Journal of Computer Research and Development,2015,52(2):318-332.(in Chinese)
    [59]Kamburugamuve S,Fox G,Leake D,et al.Survey of distributed stream processing for large stream sources[R].Technical report,2013:1-15.doi:10.13140/RG.2.1.2938.7927.
    [60]Ferber J,Gutknecht O.A meta-model for the analysis and design of organization in multi-agent systems[C]∥Proc of the 3rd International Conference on Multi-Agent Systems(ICMAS’98),1998:258-266.
    [13]岳建明,袁伦渠.智能交通发展中的大数据分析[J].生产力研究,2013(6):137-138.
    [17]孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(4):839-862.
    [57]张鹏,李鹏霄,任彦,等.面向大数据的分布式流处理技术综述[J].计算机研究与发展,2014,51(s2):1-9.
    [58]崔星灿,禹晓辉,刘洋,等.分布式流处理技术综述[J].计算机研究与发展,2015,52(2):318-332.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700