基于近似牛顿法的分布式卷积神经网络训练

英文篇名：Distributed Convolutional Neural Networks Based on Approximate Newton-type Mothod
作者：王雅慧 ; 刘博 ; 袁晓彤
英文作者：WANG Ya-hui;LIU Bo;YUAN Xiao-tong;Department of Information and Control,Nanjing University of Information Science and Technology;Jiangsu Key Laboratory of Big Data Analysis Technology;Department of Computer Science,Rutgers University;
关键词：最优化问题 ; 近似牛顿法 ; 分布式框架 ; 神经网络
英文关键词：Optimization problem;;Approximate Newton-typemethod;;Distributed framework;;Neural network
中文刊名：JSJA
英文刊名：Computer Science
机构：南京信息工程大学信息与控制学院;江苏省大数据分析技术重点实验室;罗格斯大学计算机科学学院;
出版日期：2019-07-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金(61876090,61522308)资助
语种：中文;
页：JSJA201907028
页数：6
CN：07
ISSN：50-1075/TP
分类号：186-191

摘要

大多数机器学习问题可以最终归结为最优化问题(模型学习)。它主要运用数学方法研究各种问题的优化途径及方案,在科学计算和工程分析中起着越来越重要的作用。随着深度网络的快速发展,数据和参数规模也日益增长。尽管近些年来GPU硬件、网络架构和训练方法均取得了重大的进步,但单一计算机仍然很难在大型数据集上高效地训练深度网络模型,分布式近似牛顿法作为解决这一问题的有效方法之一被引入到分布式神经网络的研究中。分布式近似牛顿法将总体样本平均分布到多台计算机,减少了每台计算机所需处理的数据量,使计算机之间互相通信,共同协作完成训练任务。文中提出了基于近似牛顿法的分布式深度学习,在相同的网络中利用分布式近似牛顿法训练,随着GPU数目呈2的指数次幂增加,训练时间呈近乎2的指数次幂减少。这与研究的最终目的一致,即在保证估计精度的前提下,利用现有分布式框架实现近似牛顿法,分布式训练神经网络,从而提升训练效率。
Most machine learning problems can ultimately be attributed to optimization problems(model learning).It mainly uses mathematics methods to study the optimal ways and solutions for various problems and plays an increasingly important role in scientific computing and engineering analysis.With the rapid development of deep networks,the scale of data and parameters also increases.Although significant advances have been made in GPU hardware,network architecture and training methods in recent years,it is still difficult for a single computer to efficiently train deep network models on large data sets.The distributed approximation Newton-type method is one of the effective methods to solve this problem.It is introduced into the study of distributed neural networks.Distributed approximation Newton-type method distributes the average sample evenly across multiple computers,the amount of data to be processed by each computer is reduced,and computers communicate with each other to complete the training task.This paper proposed distributed deep learning based on Approximation Newton-type method.The DANE algorithm is used to train in the same network.As the number of GPUs increases exponentially by 2,the training time decreases exponentially by nearly 2.This is consistent with ultimate goal,that is,on the premise of ensuring the estimation accuracy,the existing distributed framework is used to implement the approximate Newton-like algorithm,and the algorithm is used to train the neural network in a distributed manner to improve the operating efficiency.

引文

[1] GANDHI A,THOTA S,DUBE P,et al.Autoscaling for Hadoop Clusters[C]//IEEE International Conference on Cloud Engineering.IEEE,2016:109-118.
    [2] YUAN Y,SALMI M F,YIN H,et al.Spark-GPU:An accele- rated in-memory data processing engine on clusters[C]//IEEE International Conference on Big Data.IEEE,2017:273-283.
    [3] SAMADDAR S,SINHA R,DE R K.A MODEL for DISTRIBUTED PROCESSING and ANALYSES of NGS DATA under MAP-REDUCE PARADIGM[J].IEEE/ACM Transactions on Computational Biology & Bioinformatics,2018,PP(99):1.
    [4] NASR M M,SHAABAN E M,HAFEZ A M.Building Sentiment analysis Model using Graphlab[J].International Journal of Scientific & Engineering Research,2017,8(6):1155-1160.
    [5] JIANG J,CUI B,ZHANG C,et al.Heterogeneity-aware Distri- buted Parameter Servers[C]//ACM International Conference.ACM,2017:463-478.
    [6] CHEN T,LI M,LI Y,et al.Mxnet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv preprint arXiv:1512.01274,2015.
    [7] ZINKEVICH M,WEIMER M,SMOLA A J,et al.ParallelizedStochastic Gradient Descent[C]//Advances in Neural Information Processing Systems 23,Conference on Neural Information Processing Systems 2010.DBLP,2010:2595-2603.
    [8] ZHANG Y,DUCHI J C,WAINWRIGHT M J.Communication-efficient algorithms for statistical optimization[C]//Internatio-nal Conference on Neural Information Processing Systems.Curran Associates Inc.,2012:1502-1510.
    [9] GUPTA S,ZHANG W,WANG F.Model Accuracy and Run- time Tradeoff in Distributed Deep Learning:A Systematic Study[C]//IEEE,International Conference on Data Mining.IEEE,2017:171-180.
    [10] SHALEV-SHWARTZ S,SHAMIR O,SREBRO N,et al.Sto- chastic convex optimization[C]//Annual Conference on Learning Theory.2009.
    [11] SRIDHARAN K,SHALEV-SHWARTZ S,SREBRO N.Fast rates for regularized objectives[C]//Advances in Neural Information Processing Systems.2009:1545-1552.
    [12] NAJAFABADI M M,KHOSHGOFTAAR T M,VILLANUS- TRE F,et al.Large-scale distributed L-BFGS[J].Journal of Big Data,2017,4(1):22.
    [13] ERSEGHE T.Distributed Optimal Power Flow Using ADMM[J].IEEE Transactions on Power Systems,2014,29(5):2370-2380.
    [14] TAYLOR G,BURMEISTER R,XU Z,et al.Training neural networks without gradients:a scalable ADMM approach[C]//International Conference on International Conference on Machine Learning.JMLR.org,2016:2722-2731.
    [15] WANG Y,YIN W,ZENG J.Global convergence of ADMM in nonconvex nonsmooth optimization[J].Journal of Scientific Computing,2015(1-2):1-35.
    [16] FENG X,CHANG L,LIN X,et al.Distributed computing connected components with linear communication cost[J].Distributed and Parallel Databases,2018,36(3):555-592.
    [17] SHAMIRO,SREBRO N,ZHANG T.Communication-efficient distributed optimization using an approximate Newton-type method[C]//International Confe-rence on International Confe-rence on Machine Learning.JMLR.org,2014:II-1000.
    [18] ZHANG Y,WAINWRIGHT M J,DUCHI J C.Communication-efficient algorithms for statistical optimization[C]//Advances in Neural Information Processing Systems.2012:1502-1510.
    [19] LI M.Scaling Distributed Machine Learning with the Parameter Server[C]//International Conference on Big Data Science and Computing.ACM,2014:3.
    [20] CHAUDHARI P,BALDASSI C,ZECCHINA R,et al.Parle:parallelizing stochastic gradient descent[J].arXiv preprint ar-Xiv:1707.00424,2017.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700