摘要
Spark下分布式深度信念网络(Distributed Deep Belief Network,DDBN)存在数据倾斜、缺乏细粒度数据置换、无法自动缓存重用度高的数据等问题,导致了DDBN计算复杂高、运行时效性低的缺陷.为了提高DDBN的时效性,提出一种Spark下DDBN数据并行加速策略,其中包含基于标签集的范围分区(Label Set based on Range Partition,LSRP)算法和基于权重的缓存替换(Cache Replacement based on Weight,CRW)算法.通过LSRP算法解决数据倾斜问题,采用CRW算法解决RDD(Resilient Distributed Datasets)重复利用以及缓存数据过多造成内存空间不足问题.结果表明:与传统DBN相比,DDBN训练速度提高约2.3倍,通过LSRP和CRW大幅提高了DDBN分布式并行度.
DDBN(Distributed Deep Belief Network,DDBN)has many problems in Spark,such as data skew,lack of fine-grained data replacement,and unable to cache data with high re-usability automatically,resulting in high complexity and low timeliness of DDBN computing.In order to improve the timeliness of DDBN,aparallel acceleration strategy is proposed for DDBN in Spark,which includes LSRP(Label Set based on Range Partition,LSRP)algorithm and CRWS(Cache Replacement based on Weight Statistics,CRWS)algorithm.The problem of data skew is solved by LSRP algorithm,and CRW algorithm is used to solve the problem of RDD reuse and cached data caused by insufficient memory space.The results show that compared with the traditional DBN,the training speed of DDBN is increased by about 2.3 times,and the distributed parallelism of DDBN is greatly improved through LSRP and CRW.
引文
[1]毛毅,陈稳霖,郭宝龙,等.基于密度估计的逻辑回归模型[J].自动化学报,2014,40(1):62-72.
[2]谭熊,余旭初,秦进春,等.高光谱影像的多核SVM提取[J].信息安全与技术,2014,35(2):405-411.
[3]张伐伐,李卫忠,卢柳叶,等.SVM多窗口纹理土地利用信息提取技术[J].遥感学报,2012,16(1):67-78.
[4]饶萍,王建力,王勇,等.基于多特征决策树的建设用地信息提取[J].农业工程学报,2014,30(12):233-240.
[5] Alham N K,Li M,Liu Y,et al.A Map Reduce-based distributed SVM algorithm for automatic image annotation[J].Computers&Mathematics with Applications,2011,62(7):2801-2811.
[6] Hodge V J,O′Keefe S,Austin J.Hadoop neural network for parallel and distributed feature selection[J].Neural Networks the Official Journal of the International Neural Network Society,2016:78:24-35.
[7] Veetil S,Gao Q.Chapter 18-Real-time Network Intrusion Detection Using Hadoop-Based Bayesian Classifier[J].Emerging Trends in Ict Security,2014:281-299.
[8]王诏远,王宏杰,邢焕来,等.基于Spark的蚁群优化算法[J].计算机应用,2015,35(10):2777-2780.
[9] Arias J,Gamez J A,Puerta J M.Learning istributed discrete Bayesian Network Classifiers under Map Reduce with Apache Spark[J].Knowledge-Based Systems,2017.16-26
[10]Maillo J,Ramírez S,Triguero I,et al.KNN-IS:An Iterative Spark-based design of the k-Nearest Neighbors Classifier for BigData[J].Knowledge-Based Systems,2016.
[11]Seide F,Fu H,Droppo J,et al.1-bit stochastic gradient descent and its applicationto data-parallel distributed training of speech DNNs[J].Interspeech,2014:1058-1062.