摘要
提出一种基于矩阵转换的高效卷积计算优化方法MCFA。根据输出矩阵的宽度和卷积核大小对输入矩阵进行分块,通过im2col方法转换输入矩阵子块和核函数矩阵,利用计算统一设备架构中封装的矩阵-矩阵乘法加速库提升卷积计算的速度。在此基础上,将输出子块按序排列,最终得到完整的输出矩阵。实验结果证明,该方法相比im2col方法能节省61.25%的计算空间,相比MEC方法能提高20.57%的计算速度,且在分块情况下可以缓解大输入矩阵引起的缓存压力,提高缓存利用率。
An efficient convolution calculation optimization method MCFA based on matrix transformation is proposed.The input matrix is divided into blocks according to the width and the convolution core size of the output matrix.The input matrix sub-blocks and the core function matrix are transformed by im2 col method.The matrix-matrix multiplication library encapsulated in the Computing Unified Device Architecture(CUDA) is used to speed up the convolution calculation.On this basis,the output sub-blocks are arranged in order,and the complete output matrix is finally obtained.Experimental results show that this method can save 61.25% of the computing space compared with im2 col method,improve 20.57% of the computing speed compared with MEC method,and relieve the cathe pressure caused by large input matrix in the case of block,thus improve the cache utilization.
引文
[1] DALAL N,TRIGGS B.Histograms of oriented gradients for human detection[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Computer Society,2005:886-893.
[2] ZHOU Huiyu,YUAN Yuan,SHI Chunmei.Object tracking using SIFT features and mean shift[J].Computer Vision and Image Understanding,2009,113(3):345-352.
[3] SHARIF R A,AZIZPOUR H,SULLIVAN J,et al.CNN features off-the-shelf:an astounding baseline for recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops.Washington D.C.,USA:IEEE Press,2014:156-163.
[4] 王晓晖,盛斌,申瑞民.基于深度学习的深度图超分辨率采样[J].计算机工程,2017,43(11):252-260.
[5] 李传朋,秦品乐,张晋京.基于深度卷积神经网络的图像去噪研究[J].计算机工程,2017,43(3):253-260.
[6] 周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017,40(6):1229-1251.
[7] YANG Fan,CHOI W,LIN Yuanqing.Exploit all the layers:fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:236-243.
[8] WANG Xiaolong,SHRIVASTAVA A,GUPTA A.A-fast-RCNN:hard positive generation via adversary for object detection[EB/OL].[2018-04-29].https://arxiv.org/pdf/1704.03414.pdf.
[9] JIA Yangqing,SHELHAMER E,DONAHUE J,et al.Caffe:convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia.New York,USA:ACM Press,2014:269-280.
[10] CHO M,BRAND D.MEC:memory-efficient convolu-tion for deep neural network[EB/OL].[2018-04-25].https://arxiv.org/pdf/1706.06873.pdf.
[11] BERGSTRA J,BASTIEN F,BREULEUX O,et al.Theano:deep learning on GPUs with Python[EB/OL].[2018-04-25].http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.678.1889&rep=rep1&type=pdf.
[12] CHETLUR S,WOOLLEY C,VANDERMERSCH P,et al.cuDNN:efficient primitives for deep learning[EB/OL].[2018-04-25].https://arxiv.org/pdf/1410.0759.pdf.
[13] JIA Yangqing.Learning semantic image representations at a large scale[EB/OL].[2018-04-26].https://cloudfront.escholarship.org/dist/prd/content/qt64c2v6sn/qt64c2v6sn.pdf.
[14] ZEE F G V.BLIS:a framework for rapidly instantiating BLAS functionality[J].ACM Transactions on Mathematical Software,2013,41(3):1-33.
[15] CIRE?AN D C,MEIER U,MASCI J,et al.High-performance neural networks for visual object classification[EB/OL].[2018-04-26].https://arxiv.org/pdf/1102.0183.pdf.
[16] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-04-28].https://arxiv.org/pdf/1409.1556.pdf.
[17] WINOGRAD S.Arithmetic complexity of computations[M].[S.l.]:Society for Industrial and Applied Mathematics,1980.
[18] VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:framework-agnostic high-performance machine learning abstractions[EB/OL].[2018-04-25].https://arxiv.org/pdf/1802.04730.pdf.
[19] NVIDIA C.CUBLAS library[EB/OL].[2018-04-28].https://arcb.csc.ncsu.edu/~mueller/cluster/nvidia/0.8/NVIDIA_CUBLAS_Library_0.8.pdf.