一种Spark下分布式DBN并行加速策略

英文篇名：A Parallel Acceleration Strategy for Distributed DBN in Spark
作者：黄震 ; 钱育蓉 ; 于炯 ; 英昌甜 ; 赵京霞
英文作者：HUANG Zhen;QIAN Yu-rong;YU Jiong;Ying Chang-tian;Zhao Jing-xia;School of Software,Xinjiang University;School of Information Science and Engineering,Xinjiang University;Postdoctoral Research Station of Electrical Engineering,Xinjiang University;
关键词：分布内存计算框架 ; 缓存替换 ; 范围分区 ; 深度信念网络 ; 数据倾斜
英文关键词：distributed memory computing framework;;cache replacement;;range partition;;deep belief network;;data skew
中文刊名：WXYJ
英文刊名：Microelectronics & Computer
机构：新疆大学软件学院;新疆大学信息科学与工程学院;新疆大学电气工程学科博士后科研流动站;
出版日期：2018-11-05
出版单位：微电子学与计算机
年：2018
期：v.35;No.414
基金：国家自然科学基金资助项目(61562086,61462079);; 新疆自治区研究生科研创新项目(XJGRI2016029);; 新疆维吾尔自治区教育厅项目(XJEDU2016S035);; 新疆大学博士科研启动基金项目(BS150257)
语种：中文;
页：WXYJ201811020
页数：6
CN：11
ISSN：61-1123/TN
分类号：106-111

摘要

Spark下分布式深度信念网络(Distributed Deep Belief Network,DDBN)存在数据倾斜、缺乏细粒度数据置换、无法自动缓存重用度高的数据等问题,导致了DDBN计算复杂高、运行时效性低的缺陷.为了提高DDBN的时效性,提出一种Spark下DDBN数据并行加速策略,其中包含基于标签集的范围分区(Label Set based on Range Partition,LSRP)算法和基于权重的缓存替换(Cache Replacement based on Weight,CRW)算法.通过LSRP算法解决数据倾斜问题,采用CRW算法解决RDD(Resilient Distributed Datasets)重复利用以及缓存数据过多造成内存空间不足问题.结果表明:与传统DBN相比,DDBN训练速度提高约2.3倍,通过LSRP和CRW大幅提高了DDBN分布式并行度.
DDBN(Distributed Deep Belief Network,DDBN)has many problems in Spark,such as data skew,lack of fine-grained data replacement,and unable to cache data with high re-usability automatically,resulting in high complexity and low timeliness of DDBN computing.In order to improve the timeliness of DDBN,aparallel acceleration strategy is proposed for DDBN in Spark,which includes LSRP(Label Set based on Range Partition,LSRP)algorithm and CRWS(Cache Replacement based on Weight Statistics,CRWS)algorithm.The problem of data skew is solved by LSRP algorithm,and CRW algorithm is used to solve the problem of RDD reuse and cached data caused by insufficient memory space.The results show that compared with the traditional DBN,the training speed of DDBN is increased by about 2.3 times,and the distributed parallelism of DDBN is greatly improved through LSRP and CRW.

引文

[1]毛毅,陈稳霖,郭宝龙,等.基于密度估计的逻辑回归模型[J].自动化学报,2014,40(1):62-72.
    [2]谭熊,余旭初,秦进春,等.高光谱影像的多核SVM提取[J].信息安全与技术,2014,35(2):405-411.
    [3]张伐伐,李卫忠,卢柳叶,等.SVM多窗口纹理土地利用信息提取技术[J].遥感学报,2012,16(1):67-78.
    [4]饶萍,王建力,王勇,等.基于多特征决策树的建设用地信息提取[J].农业工程学报,2014,30(12):233-240.
    [5] Alham N K,Li M,Liu Y,et al.A Map Reduce-based distributed SVM algorithm for automatic image annotation[J].Computers&Mathematics with Applications,2011,62(7):2801-2811.
    [6] Hodge V J,O′Keefe S,Austin J.Hadoop neural network for parallel and distributed feature selection[J].Neural Networks the Official Journal of the International Neural Network Society,2016:78:24-35.
    [7] Veetil S,Gao Q.Chapter 18-Real-time Network Intrusion Detection Using Hadoop-Based Bayesian Classifier[J].Emerging Trends in Ict Security,2014:281-299.
    [8]王诏远,王宏杰,邢焕来,等.基于Spark的蚁群优化算法[J].计算机应用,2015,35(10):2777-2780.
    [9] Arias J,Gamez J A,Puerta J M.Learning istributed discrete Bayesian Network Classifiers under Map Reduce with Apache Spark[J].Knowledge-Based Systems,2017.16-26
    [10]Maillo J,Ramírez S,Triguero I,et al.KNN-IS:An Iterative Spark-based design of the k-Nearest Neighbors Classifier for BigData[J].Knowledge-Based Systems,2016.
    [11]Seide F,Fu H,Droppo J,et al.1-bit stochastic gradient descent and its applicationto data-parallel distributed training of speech DNNs[J].Interspeech,2014:1058-1062.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700