基于Spark并行SVM参数寻优算法的研究

英文篇名：Spark Parallel SVM Parameter Optimization Algorithm
作者：何经纬 ; 刘黎志 ; 彭贝 ; 付星堡
英文作者：HE Jingwei;LIU Lizhi;PENG Bei;FU Xingbao;Hubei Key Laboratory of Intelligent Robot(Wuhan Institute of Technology);School of Computer Science & Engineering,Wuhan Institute of Technology;
关键词：支持向量机 ; 参数寻优 ; Spark ; 并行度 ; 负载均衡
英文关键词：support vector machine;;parameter optimization;;spark;;parallelism;;load balancing
中文刊名：WHHG
英文刊名：Journal of Wuhan Institute of Technology
机构：智能机器人湖北省重点实验室(武汉工程大学);武汉工程大学计算机科学与工程学院;
出版日期：2019-06-15
出版单位：武汉工程大学学报
年：2019
期：v.41;No.212
基金：武汉工程大学第十届研究生教育创新基金(CX2018215)
语种：中文;
页：WHHG201903015
页数：7
CN：03
ISSN：42-1779/TQ
分类号：85-91

摘要

针对传统支持向量机(SVM)参数寻优算法在处理大样本数据集时存在的寻优时间过长,内存消耗过大等问题,提出了一种基于Spark通用计算引擎的并行可调SVM参数寻优算法。该算法首先使用Spark集群将训练集以广播变量的形式广播给各个Executor,然后并行化SVM的参数寻优过程,并在在寻优过程中控制Task并行度,使各个Executor负载均衡,从而加快寻优速度。实验结果表明,本文提出的参数寻优算法,通过设置合理的Task并行度,可以在充分使用集群资源的同时提高最优参数的寻找速度,减少寻优时间。
To solve the problems of the traditional support vector machine parameter optimization algorithm in dealing with large sample data sets,such as long time-consuming and excessive memory consumption,we proposed a parallel adjustable Support Vector Machine(SVM)parameter optimization algorithm based on Spark universal computing engine. Firstly,this algorithm uses Spark cluster to distribute the training set to each executor in the form of broadcast variables,and then makes the parameter optimization process of SVM parallel.In the parameter optimization process,each executor is load-balanced by controlling the parallelisms of the tasks,thereby speeding up the parameter optimization. At last the experimental results show that the proposed algorithm in this paper can improve the search speed and reduce the optimization time by setting the reasonable tasks parallelisms with making full use of the cluster resources.

引文

[1]吴云蔚,宁芊.基于Hadoop平台的分布式SVM参数寻优[J].计算机工程与科学,2017,39(6):1042-1047.
    [2]张鹏翔,刘利民,马志强.基于MapReduce的层叠分组并行SVM算法研究[J].计算机应用与软件,2015,32(3):172-176.
    [3]王越. Hadoop平台参数寻优的分布式SVM算法研究[D].西安:西安理工大学,2016.
    [4]张小琴,胡景,肖炜.基于Hadoop云平台的分布式支持向量机[J].山西师范大学学报(自然科学版),2015,29(4):19-23.
    [5]秦军,戴新华,童毅,等.基于MapReduce的SVM分类算法研究[J].计算机技术与发展,2015(6):87-91.
    [6]米允龙,米春桥,刘文奇.海量数据挖掘过程相关技术研究进展[J].计算机科学与探索,2015,9(6):641-659.
    [7]宋泊东,张立臣,江其洲.基于Spark的分布式大数据分析算法研究[J].计算机应用与软件,2019,36(1):39-44.
    [8]张红,王晓明,曹洁,等. Hadoop云平台MapReduce模型优化研究[J].计算机工程与应用,2016,52(22):22-25.
    [9] ALHAM N K,LI M,YANG L,et al. A MapReduce-based distributed SVM algorithm for automatic image annotation[J]. Computers&Mathematics with Applications,2011,62(7):2801-2811.
    [10] KE X,JIN H,XIE X,et al. A distributed SVM method based on the iterative MapReduce[C]//IEEE International Conference on Semantic Computing.Piscataway:IEEE,2015:116-119.
    [11] GUO W,ALHAM N K,LIU Y,et al. A resource aware mapreduce based parallel SVM for large scale image classifications[J]. Neural Processing Letters,2016,44(1):161-184.
    [12] MEYER O,BISCHL B,WEIHS C. Support vector machines on large data sets:simple parallel approaches[M]. Berlin:Springer International Publishing,2014.
    [13] YAN B,YANG Z,REN Y,et al. Microblog sentiment classification using parallel SVM in apache spark[C]//IEEE International Congress on Big Data(BigData Congress). Piscataway:IEEE,2017:282-288.
    [14]刘泽燊,潘志松.基于Spark的并行SVM算法研究[J].计算机科学,2016,43(5):238-242.
    [15]李坤,刘鹏,吕雅洁,等.基于Spark的LIBSVM参数优选并行化算法[J].南京大学学报(自然科学版),2016,52(2):343-352.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700