面向数据分发系统的改进型并行I/O研究

英文篇名：Research on Improved Parallel I/O for Data Distribution System
作者：肖招娣 ; 皇甫汉聪 ; 余永忠 ; 吕顺锋
英文作者：XIAO Zhao-di;HUANGFU Han-cong;YU Yong-zhong;LV Shun-feng;Foshan Power Supply Bureau,Guangdong Power Grid Co.,Ltd.;Guangdong Zhuo Wei Network Co.,Ltd.;
关键词：数据分发 ; 并行计算 ; 并行I/O ; Google ; File ; System ; 元数据
英文关键词：data distribution;;parallel computing;;parallel I/O;;Google File System;;metadata
中文刊名：ZDHJ
英文刊名：Techniques of Automation and Applications
机构：广东电网有限责任公司佛山供电局;广东卓维网络有限公司;
出版日期：2018-10-25
出版单位：自动化技术与应用
年：2018
期：v.37;No.280
语种：中文;
页：ZDHJ201810009
页数：5
CN：10
ISSN：23-1474/TP
分类号：42-46

摘要

随着用户和业务复杂度的增加,数据仓库的数据对外服务能力急需提升,数据分发系统作为统一接口分发管理,不可避免地面对多用户数据访问的并发性通信阻塞问题。本文利用开源的Kettle工具构建数据分发功能应用,运用并行计算思想提升串行算法效率。在并行化过程中,详述了传统的数据分发收集并行I/O方案,并构建了时间估计方程。在分析总结其瓶颈问题的基础上,借鉴GoogleFileSystem的思想,提出了基于元数据的并行I/O改进型新方案。实验证明,不论并行计算进程数(计算单元数)多少,基于元数据的并行I/O方案比数据分发收集方案都具有更好的性能,数据导入、导出耗时更短。
The external service capability of data warehouse urgently needs to be improved with the increase of users and business complexity. As a unified interface, data distribution system is distributed and managed, and it is inevitable to deal with the congested communication congestion with multi-user data access. In this paper, open-source kettle tools are used to build data distribution applications, parallel computing ideas are used to improve the efficiency of serial algorithms. In the parallelization process, the traditional data distribution and collection parallel I/O scheme is described in detail, and the time estimation equation is constructed. On the basis of analyzing and summarizing its bottleneck problem, this paper proposes a new scheme of parallel I/O improvement based on metadata, referring to the idea of Google File System. Experiments show that, regardless of the number of parallel computing processes(the number of computational units), the metadata-based parallel I/O scheme has better performance than the data distribution and collection scheme, and the data import and derivation takes less time.

引文

[1]杨杉,苏飞,程新洲等.面向运营商大数据的分布式ETL研究与设计[J].邮电设计技术,2016,(8):47-52.
    [2]韩文彪,李晖,陈梅等.PBS:一种面向集群环境的ETL调度算法[J].计算机与数字工程,2017,45(5):793-796,829.
    [3]丁强龙,王津,张学杰.基于子模式的关系数据到图数据ETL方法研究[J].计算机工程与应用,2017,53(12):76-84.
    [4]尤玉林,张宪民.一种可靠的数据仓库中ETL策略与架构设计[J].计算机工程与应用,2005,(10):172-174,229.
    [5]亢良伊,王建飞,刘杰等.可扩展机器学习的并行与分布式优化算法综述[J].软件学报,2017,(2):1-23.
    [6]眭俊华,刘慧娜,王建鑫等.多核多线程技术综述[J].计算机应用,2013,33(S1):239-242,261.
    [7]LU X,YUAN J,ZHANG W.Workflow of the Grover algorithm simulation incorporating CUDA and GPGPU[J].Computer Physics Communications,2013,184(9):2035-2041.
    [8]SUI H,PENG F,XU C,et al.GPU-accelerated MRF segmentation algorithm for SAR images[J].Compute rs&geosciences,2012,(43):159-166.
    [9]任涛,兰巨龙,扈红超.并行分组交换研究综述[J].计算机工程与设计,2012,33(1):47-50.
    [10]王蕾,崔慧敏,陈莉等.任务并行编程模型研究与进展[J].软件学报,2013,24(1):77-90.
    [11]刘充.基于KETTLE的高校多源异构数据集成研究及实践[J].电子设计工程,2015,23(10):24-26.
    [12]WANG J,WU H,WANG R.A new reliability model in replication-based big data storage systems[J].Journal of Parallel and Distributed Computing,2017,(108):14-27.
    [13]任向前.C300控制器远程I/O控制的实现[J].电子技术与软件工程,2014,(15):125-129.
    [14]贾君枝,赵洁.DDC关联数据实现研究[J].中国图书馆学报,2014,40(4):76-82.
    [15]LI G,PALMER R,DELISI M,et al.Formal speci fication of MPI 2.0:Case study in specifying a practical concurrent programming API[J].Science of Computer Programming,2011,76(2):65-81.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700