基于矩阵转换的卷积计算优化方法

英文篇名：Convolution Calculation Optimization Method Based on Matrix Transformation
作者：方玉玲 ; 陈庆奎
英文作者：FANG Yuling;CHEN Qingkui;Business School,University of Shanghai for Science and Technology;School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology;
关键词：深度学习 ; 卷积计算 ; 直接卷积 ; 矩阵分块 ; 计算统一设备架构 ; 卷积优化
英文关键词：deep learning;;convolution calculation;;direct convolution;;matrix blocking;;Computing Unified Device Architecture(CUDA);;convolution optimization
中文刊名：JSJC
英文刊名：Computer Engineering
机构：上海理工大学管理学院;上海理工大学光电信息与计算机工程学院;
出版日期：2018-11-02 11:02
出版单位：计算机工程
年：2019
期：v.45;No.502
基金：国家自然科学基金(61572325,60970012);; 高等学校博士学科点专项科研博导基金(20113120110008);; 上海重点科技攻关项目(14511107902,16DZ1203603);; 上海市工程中心建设项目(GCZX14014);; 上海智能家居大规模物联共性技术工程中心项目(GCZX14014);; 上海市一流学科建设项目(XTKX2012);; 沪江基金研究基地专项(C14001)
语种：中文;
页：JSJC201907035
页数：6
CN：07
ISSN：31-1289/TP
分类号：223-227+234

摘要

提出一种基于矩阵转换的高效卷积计算优化方法MCFA。根据输出矩阵的宽度和卷积核大小对输入矩阵进行分块,通过im2col方法转换输入矩阵子块和核函数矩阵,利用计算统一设备架构中封装的矩阵-矩阵乘法加速库提升卷积计算的速度。在此基础上,将输出子块按序排列,最终得到完整的输出矩阵。实验结果证明,该方法相比im2col方法能节省61.25%的计算空间,相比MEC方法能提高20.57%的计算速度,且在分块情况下可以缓解大输入矩阵引起的缓存压力,提高缓存利用率。
An efficient convolution calculation optimization method MCFA based on matrix transformation is proposed.The input matrix is divided into blocks according to the width and the convolution core size of the output matrix.The input matrix sub-blocks and the core function matrix are transformed by im2 col method.The matrix-matrix multiplication library encapsulated in the Computing Unified Device Architecture(CUDA) is used to speed up the convolution calculation.On this basis,the output sub-blocks are arranged in order,and the complete output matrix is finally obtained.Experimental results show that this method can save 61.25% of the computing space compared with im2 col method,improve 20.57% of the computing speed compared with MEC method,and relieve the cathe pressure caused by large input matrix in the case of block,thus improve the cache utilization.

引文

[1] DALAL N,TRIGGS B.Histograms of oriented gradients for human detection[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Computer Society,2005:886-893.
    [2] ZHOU Huiyu,YUAN Yuan,SHI Chunmei.Object tracking using SIFT features and mean shift[J].Computer Vision and Image Understanding,2009,113(3):345-352.
    [3] SHARIF R A,AZIZPOUR H,SULLIVAN J,et al.CNN features off-the-shelf:an astounding baseline for recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops.Washington D.C.,USA:IEEE Press,2014:156-163.
    [4] 王晓晖,盛斌,申瑞民.基于深度学习的深度图超分辨率采样[J].计算机工程,2017,43(11):252-260.
    [5] 李传朋,秦品乐,张晋京.基于深度卷积神经网络的图像去噪研究[J].计算机工程,2017,43(3):253-260.
    [6] 周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017,40(6):1229-1251.
    [7] YANG Fan,CHOI W,LIN Yuanqing.Exploit all the layers:fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:236-243.
    [8] WANG Xiaolong,SHRIVASTAVA A,GUPTA A.A-fast-RCNN:hard positive generation via adversary for object detection[EB/OL].[2018-04-29].https://arxiv.org/pdf/1704.03414.pdf.
    [9] JIA Yangqing,SHELHAMER E,DONAHUE J,et al.Caffe:convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia.New York,USA:ACM Press,2014:269-280.
    [10] CHO M,BRAND D.MEC:memory-efficient convolu-tion for deep neural network[EB/OL].[2018-04-25].https://arxiv.org/pdf/1706.06873.pdf.
    [11] BERGSTRA J,BASTIEN F,BREULEUX O,et al.Theano:deep learning on GPUs with Python[EB/OL].[2018-04-25].http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.678.1889&rep=rep1&type=pdf.
    [12] CHETLUR S,WOOLLEY C,VANDERMERSCH P,et al.cuDNN:efficient primitives for deep learning[EB/OL].[2018-04-25].https://arxiv.org/pdf/1410.0759.pdf.
    [13] JIA Yangqing.Learning semantic image representations at a large scale[EB/OL].[2018-04-26].https://cloudfront.escholarship.org/dist/prd/content/qt64c2v6sn/qt64c2v6sn.pdf.
    [14] ZEE F G V.BLIS:a framework for rapidly instantiating BLAS functionality[J].ACM Transactions on Mathematical Software,2013,41(3):1-33.
    [15] CIRE?AN D C,MEIER U,MASCI J,et al.High-performance neural networks for visual object classification[EB/OL].[2018-04-26].https://arxiv.org/pdf/1102.0183.pdf.
    [16] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-04-28].https://arxiv.org/pdf/1409.1556.pdf.
    [17] WINOGRAD S.Arithmetic complexity of computations[M].[S.l.]:Society for Industrial and Applied Mathematics,1980.
    [18] VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:framework-agnostic high-performance machine learning abstractions[EB/OL].[2018-04-25].https://arxiv.org/pdf/1802.04730.pdf.
    [19] NVIDIA C.CUBLAS library[EB/OL].[2018-04-28].https://arcb.csc.ncsu.edu/～mueller/cluster/nvidia/0.8/NVIDIA_CUBLAS_Library_0.8.pdf.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700