自动向量化中的收益评估技术研究

英文题名：Research on Benefit Evaluation Techniques in Automatic Vectorization
作者：张媛媛
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：SIMD ; 自动向量化 ; 收益评估 ; 多面体表示 ; SLP向量化
英文关键词：SIMD ; automatic vectorization ; profit evaluation ; polyhedral ; SLP vectorizaton
学位年度：2011
导师：赵荣彩
学科代码：081202
学位授予单位：解放军信息工程大学
论文提交日期：2011-04-10

摘要

当前,以音/视频、图形图像等为主要服务内容的多媒体产业迅猛发展,使得通用微处理器体系结构广泛支持多媒体扩展技术。多媒体扩展主要利用多媒体程序中广泛存在的SIMD并行性以获得性能加速,加速主要体现在多媒体扩展指令集对小数据类型运算和多媒体程序中常见但比较复杂运算的SIMD支持上。近些年来在进行高性能计算和嵌入式DSP处理器中也都大量引入SIMD机制,用以加速和提升处理性能。因此,这些计算技术和应用的发展促进了编译技术中自动向量化技术的研究和发展。然而,在实际的研究和技术实现中,代码向量化必然也要引入一定的开销,只有在向量化后带来的收益大于其开销时进行向量化才有意义。在什么情况下的向量化才能获得应有的收益?本文的核心内容就是围绕这一关键问题,分析、探究向量化的收益评估模型、设计方法和实现技术,以准确、量化的形式支持编译的优化。
     首先,论文针对SIMD自动向量化过程中出现的代码性能损失问题,对SIMD收益评估进行定义,阐述收益评估涉及到的WHIRL中间表示,确定收益代价评估模型,选定最优PACK的生成方案。
     其次,在传统向量化的基础上,本文提出了一种基于SLP向量化方法的向量化基本收益评估模型,以及内存访问在连续不对齐模式下局部数据重组的开销评估方法,并基于此实现了SIMD向量化性能的优化。明确了向量化性能的几大影响因素,即:内存的连续访问、内存的对齐访问、拼接指令以及寄存器压力等。
     另外,循环变换可对向量化程序的性能产生影响。本文探讨了不同循环变换类型对应的多面体表示,研究了多面体表示下的向量化性能度量模型。基于原SIMD编译优化框架,本文提出了新的指导循环变换的优化流程,给出搜索最优方案的优化算法(OVAS),并详细阐述了在非连续内存访问模式(non-unit stride access)下收益代价模型。在存储访问模型的基础上提出一种不对齐访问模式的代价评估方法,给出了基于多面体表示的SIMD向量化代价模型。
     最后,进行了技术实现和对实现系统的充分测试和分析,并对今后的工作做了进一步的展望。
With the rapid development of multimedia industry including audio/video and image etc recently, the general micro-processor architecture widely supports the multimedia extension technology. The multimedia extension mainly takes advantage of the widely existence of SIMD parallelism in program to achieve performance acceleration. The acceleration is implemented by SIMD instruction sets support for small data type operations and classical complex operations. In recent years, SIMD mechanism is being largely introduced in High Performance Computing and embedded DSP processors, accelerating and promoting the performance. So with these development of computing technology and applications, research on automatic vectorization in compilation technology has been facilitated and developed. However, in realistic research and technique implementation, code vectorization is not always prior to the original scalar code, by bringing in some overhead and is worth doing only when vectorization benefits is greater than costs. In which case can vectorization be profitable? The key content of the thesis encloses this key problem, analyses and explores benefit analysis model in automatic vectorization as well as the methods and implementation technique in order to support compiler optimization accurately and quantitatively.
     In the thesis, firstly, in order to resolve the performance loss in SIMD auto-vectorization, we define the SIMD profitability evaluation, present the WHIRL intermediate representation which involve in profitability evaluation, establish the profit-cost evaluation model and then find the best pack generation solution.
     Secondly, he thesis propose the SLP vectorization basic profitability evaluation model based on traditional vectorization as well as the cost evaluation method in the contiguous and misalignment mode, influence vectorization performance including continuous memory access, alignment memory access, splice instruction and register spill etc.
     Thirdly, loop transformation can have impact on vectorization performance. The thesis discusses different polyhedral representation of loop transformation, research vectorization performance metrics model under polyhedral representation. Based on the current SIMD compiler optimization framework, the thesis brings out the novel optimization procedure to guide loop transformation, present optimal-vectorization-alternatives-search(OVAS) algorithm, clarifies profit-cost model in non-unit stride access, proposes cost evaluation method under misalignment access based on memory access model and establish the SIMD vectorization cost model based on polyhedron representation.
     At last, we implement the technique mentioned above, make tests on targeted machine, analyze the results and make a further expectation on relating work.

引文

[1] Klimovitski, A. Using SSE and SSE2: Misconceptions and reality[J],Intel Developer UPDATE Magazine,volume6,2001
    [2] Oberman, S. and Favor, G. and Weber, F. AMD 3DNow! technology: Architecture and implementations[J],Micro,IEEE,volume19,No.2,pp:37--48,1999
    [3] Barnes, G.H. and Brown, R.M. and Kato, M. and Kuck, D.J. and Slotnick, D.L. and Stokes, R.A. The Illiac IV Computer[J] IEEE Trans.on Computer, Vol.100,pp:746-757,1968.
    [4] Randy Allen and Kennedy著.张兆庆,乔如良,冯晓兵等译.现代体系结构的优化编译器[M].北京:机械工业出版社,2004: 4.
    [5] Banerjee U k“Dependence Analysis for Supercomputing'’,Kluwer Academic Publishers,Norwell,MA,USA,1988
    [6] Pugh W,“The omega test:a fast and practical integer programming algorithm for dependence analysis”,Proceeding of the 1 991 ACM/IEEE conference on Supercomputing, ACM Press,New York,USA,1991
    [7] Juurlink B, Vassiliadis S,“Implementation and evaluation of the complex streamed instruction set”, Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques,IEEE Computer Society,2001
    [8] Talla D, John L K, Burger D,“Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements”, IEEE Trans. Comput,2003,52(8):1015-1031
    [9] Slingerland N,Smith A J, Measuring the performance of multimedia instruction sets, IEEE transactions on computers,2002,51(11):1317-1332
    [10] Samuel Larsen, Saman P. and Amarasinghe. Compilation Techniques for Short Vector Instructions.2006.
    [11] Compilers:Principles,Techniques and Tools.(美)Alfred V.Aho,Monica S.Lam, Ravi Sethi,Jeffrey D.Ullman.机械工业出版社.
    [12] M.E.Crovella.Performance Prediction and Tuning of Parallel Programs thesis .University of Rochester, New York 1994.
    [13]李玉祥.面向非多媒体程序的SIMD向量化方法及优化技术研究[D].中国科学技术大学,2008.
    [14] Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks and Ira Rosen. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. IBM Haifa Research Lab, 2009.
    [15]张平.并行化编译器中的并行程序自动生成和性能优化技术研究[D].郑州:解放军信息工程大学博士论文, 2006.5.
    [16] Kai Hwang. Advamced Computer Architecture—Parallelism, Scalability, Programmability McGraw-hill, 1993.
    [17] Randy Allen and Kennedy. Optimizing Compilers for Modern Architectures—A Dependence-Based Approach. (ISBN 1-55860-286-0) Elsevier Science, 2001 (135-317).
    [18] E. T. Kalns, H. Xu, and L.M. Ni. Evaluation of Data Distribution Patterns in Distributed-Memory Machines[A]. Proc. Int’l Conf. Parallel Processing[C], St. Charles, Ill., Aug. 1993: II-175-183.
    [19]韩林.面向分布存储结构的并行分解一致性优化技术研究[D].解放军信息工程大学,2008.
    [20] A COMPILE-TIME OPENMP COST MODEL, Chunhua Liao.2007.
    [21] Dorit Nuzman. Richard Henderson, Multi-platform auto-vectorization, Code Generation and Optimization, Proceedings of the International Symposium on Code Generation and ptimization. pp: 281-294, 2006.
    [22]王迪.SIMD编译优化技术研究[M].浙江大学,2008.
    [23] Kokn L, Maturana G M, Prabhu Q, The Visual Instruction Set(VIS) inUltraSPARC, IEEE micro,August,1996.
    [24] Erlv. http://www.lingcc.com/
    [25]面向SoC的系统级设计方法与技术研究进展[ EB /OL].[2005-11-10].
    [26] http://www.chinaecnet.com/mkt/qs054334.asp.
    [27] Aho V, Sethi R, Ullman J D. Compilers: Principles, Techniques, and Tools[M]. Reading, MA: Addison-Wesley, 1986.
    [28] Kuck D J. The structure of Computers and Computations. Volume[M]. New York: John Wiley and Sons, 1978.
    [29] Ferrante J, Ottenstein K J, Warren J D. The Program Dependency Graph and Its Uses in Optimization[J]. ACM Trans on Programming Languages and Systems, 1987, 9(3):319-349.
    [30] Feautrier P. Some Efficient Solutions to the Affine Scheduling Problem, Part I, One Dimensional Time[J]. International Journal of Parallel Programming, 1992, 21(5): 315-348.
    [31] Kelly W, Pugh W. A Framework for Unifying Reordering Transformations[R]. Technical Report CS-TR-3193, University of Maryland, 1993.
    [32] Bastoul C, Cohen A, Girbal S, et al. Putting Polyhedral Loop Transformations to Work[C], Proc of workshop on languages and compilers for parallel computing, 2003.
    [33] Cohen A, Girbal S, Temam O. A Polyhedral Approach to ease the composition of programe transformations[C], Proc of Europar Int’l Conf on Parallel and Distributed Computing, 2004.
    [34] ORC Research. Open Research Compiler for ItaniumTM Processor Family[EB/OL]. http://ipf-orc.sourceforge.net/H.
    [35] Pop S, Cohen A, Bastoul C, et al. GRAPHITE: Polyhedral Analyses and Optimizations forGCC[C], Proc of the 4th GCC Developer’s Summit, 2006.
    [36] Feautrier P. Some Efficient Solutions to the Affine Scheduling Problem, Part I, One Dimensional Time[J]. International Journal of Parallel Programming, 1992, 21(5): 315-348.
    [37] ME Wolf. Improving locality and parallelism in nested loops[D]. Standford University, 1992.
    [38] Sylvain Girbal, Nicolas Vasilache, et al. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 2006.
    [39]陆平静,车永刚等.多面体表示技术及其在性能优化中的应用.计算机工程与科学, Vol.30, No.9, 2008.
    [40] Dorit Nuzman, Ira Rosen and Ayal Zaks.Auto-vectorization of interleaved data for SIMD.In Programming language design and implementation PLDI, 2006.
    [41] Erlv. http://www.lingcc.com/
    [42] Compilation Techniques for Short-Vector Instructions by Samuel Larsen, Master of Science, Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2000.
    [43] Cost model implementation in GCC4 interacting with ASM. http://www.hitech -projects.com/euprojects/ACOTES/deliverables/acotes-d4.3-final.pdf.
    [44] Kai Hwang. Advanced Computer Architecture - Parallelism, Scalability, Programmability[R]. McGraw-hill, 1993.
    [45] Pathscale Compiler Users Guide.2.0 [EB/OL]. http://www.pathscale.com/
    [46] W . Blume,R . Doallo,R . Eigemann,et . Parallel Programming with Polaris IEEE Computer,29(12):78-82,Dec.1996.
    [47] J.M.Bull.A Hieriarchical Classification of Overheads in Parallel Programs Proceeding of First IFIP TC10 International Workshop on Software Engineering for Parallel and Distributed Systems,March 1996.
    [48] U. Banerjee. Speedup of ordinary programs[D]. Department of Computer Science, University of Illinois at Urbana-Champaign, October 1979. Report No. 79-989.
    [49] A Framework for Performance Modeling and Prediction, Allan Snavely, Laura Carrington, Nicole Wolter of The San Diego Supercomputer Center with Jesus Labarta, Rosa Badia of The Technical University of Catalonia and Avi Purkayastha of The Texas Advanced Computing Center.
    [50]沈亚楠,姚远,张平等.分布式系统中数据分解的研究.计算机工程,2006,32(11):114.
    [51] Kuck D.the Structure of Computers and Computations. New York: John Wiley and Sons, 1978.
    [52] Wolfe M j. Optimizing Compilers for Supercomputers. Ph. D. Dissertation, Report 82-1105, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, 1982.
    [53] Banejee U. Loop Transformations for Restructuring Compilers- The Foundations. Norwell, Massachusetts: Kluwer Academic Publishers, 1993.
    [54] Z. Li, P. Yew, and C. Zhu. Data dependence analysis on multi-dimensional array references [A]. In Proceedings of the 1989 ACM International Conference on Supercomputing[C], June 1989.
    [55]沈志宇,胡子昂等.并行编译方法.北京:国防工业出版社,2000.
    [56] F. Allen, M. Burke, R. Cytron, J. Ferrante, W. Hsieh, and V. Sarkar. A framework for determining useful parallelism [A]. In Proceedings of the 1968 ACM International Conference on Supercomputing[C], July 1988: 207-215.
    [57] Han Lin, Pang JianMin, Zhao RongCai, Qi Ning. Exploiting Coarse Grain Parallelism with Parallel Recognition Compiler [A]. DCABES 2006[C]. 2006: 243-245.
    [58]同济大学数学教研室编.线性代数(第三版)[M].高等教育出版社,1999年6月.
    [59]龚雪容.分布存储结构中并行程序自动生成和优化技术的研究与实现[D].郑州:解放军信息工程大学博士论文, 2008.6.
    [60] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, J. Ramanujam. A hyperplane based approach for optimizing spatial locality in loop nests[A]. In: Proceedings of 1998 ACM International Conference on Supercomputing (ICS'98)[C], Melbourne, Australia, 1998: 69-76.
    [61] MaryW.Hall, BrianR.Murphy, SamanP.Amarasinghe, Shih-WeiLiao.Interprocedure Analysisfor Parallelization. Eigth International Workshop on Languages and Compilers for Parallel Computing, August 1995.
    [62] Banerjee U. Unimodular transformations of double loops. Advances in Languages and Compilers for Parallel Processing. Cambridge, MA: MIT Press, 1991(33-44).
    [63] S-T. Leung. Array restructuring for cache locality[R]. Dept. Computer Science and Engineering. University of Washington, 1996.
    [64] R.Eigenmann and J.Hoeflinger Parallelizing and Vectorizing Compilers January 2000
    [65] A .Thirumalai, J.Ramanujam. Efficient computation of address sequences in data-parallel programs using closed forms for basic vectors. Journal of Parallel and Distributed Computing,1996:38(2):188-203
    [66] D. A. Padua et al. Polaris: A new-generation parallelizing compiler for MPPs[R]. Technical Report CSRD-1306, Center for Supercomputing Research and Development, Univ. of Illinois at Uurbana-Champaign, 1993
    [67] J. Du, D. Chen and L. Xie. JAPS: An Automatic Parallelizing System Based on Java[J]. Science China(Series E). 1999: 396-906.
    [68] Kathryn S. Mckinley, Steve Carr and Chau-Wen Tseng.Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems,Vol.18,No.4,July 1996,Pages 424-453.
    [69] Alexandre E.Eichenberger, Peng Wu and Kevin O’Brien.Vectorization for SIMD Architectures with Alignment Constraints.In proceedings of the SIGPLAN’04 Conference on Programming Language Design and Implementation,2004.
    [70] Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks and Ira Rosen. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. IBM Haifa Research Lab, 2009.
    [71] Samuel Larsen, Rodric Rabbah and Saman Amarasinghe.Exploiting vector parallelism in software pipelined loops.In Proc. of the 38th Annual International Symposium on Microarchitecture, 2005.
    [72] Alfred V Aho, Ravi Sethi, Jeffrey D. Ullman,“Compilers: Principles, Techniques,and Tools”, Addison Wesley, 1986.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700