SIMD编译优化方法研究

作者：朱嘉华
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：多媒体应用 ; 单指令多数据指令集 ; SIMD编译优化
英文关键词：Multimedia Application ; Single Instruction Multi Data Instruction Set ; SIMD Optimization
学位年度：2005
导师：朱传琪 ; 臧斌宇
学科代码：081201
学位授予单位：复旦大学
论文提交日期：2005-11-06

摘要

随着多媒体应用在计算领域中的重要性日趋显著，为了在多媒体应用中获得更为出色的性能表现，几乎所有的通用处理器生产厂商都为他们的处理器集成了一个或多个多媒体扩展部件。由于多媒体应用的核心代码往往具有高度的可并行性，并且对计算精度的要求远低于通用处理器所提供的计算能力，因此这些多媒体扩展部件往往以通用处理器已有的计算资源为基础，通过一系列整合后，以向量部件的形式出现。而相应的扩展指令集则以单指令多数据(Single
     Instruction Multi Data)的向量指令为主，我们将这些指令称作“SIMD多媒体扩展指令”，简称“SIMD指令”，同时，我们称这些多媒体扩展部件为SIMD部件。
     目前，程序员只能通过编译器有限的支持来使用各种SIMD指令，例如通过嵌入汇编代码，或者使用编译器提供的内部函数(intrinsic function)等手段在代码中显式地使用SIMD部件的指定计算功能。这些手段的使用要求程序员深入了解SIMD部件的体系结构，并且具备基本的向量化程序开发能力，大大提高了多媒体程序开发的难度。一方面，由于各种处理器中集成的SIMD部件及相关的SIMD指令各不相同，针对不同SIMD部件开发的代码也不完全相同，从而导致了代码的可重用程度较低。为了降低开发的难度，提高代码的可重用程度，我们必须加强编译器对SIMD扩展指令集的支持，使之能够利用SIMD指令自动优化高级语言编写的多媒体代码，编译器的这一功能被称作SIMD编译优化。
     虽然，SIMD编译优化技术隶属于向量化技术，然而，由于通用处理器中集成的多媒体部件与传统的向量处理器之间存在着显著的差异，致使传统的向量化技术无法被简单地移植于SIMD编译优化中。迄今为止，只有少数商业编译器(如ICC8.0)能够对个别多媒体程序实施有效地SIMD优化。而在学术界，针对SIMD编译的研究往往过于理想化，而无法提高真实多媒体应用的性能。
     本文阐述了将传统的向量化方法运用于SIMD编译优化中的可能性与潜在的障碍。并以此为基础，展开了SIMD编译优化技术的研究，提出了多项能够显著提高代码性能的编译优化技术。并以开放源代码(Open Source)编译器Gcc3.5，以及并行编译平台Aggassiz为基础实现了SIMD优化编译的原型系统。其中，主要的工作与创新包括：
     1．详细研究了当前常见的多媒体应用的核心代码以及流行的SIMD部件的体系结构，指出了SIMD多媒体编译优化技术与传统向量化方法的共同点
    与重要区别。
     2．针对向量化技术中常用的位宽分析无法适应SIMD编译的需求，提出了基于双向数据流分析的有效位分析方法，为SIMD编译优化控制数据宽度、提高并行度提供了必要的分析基础。
     3．由于SIMD微处理器体系结构的多媒体扩展与高级语言中的整数扩展规则有不协调之处，严重影响了程序的执行效率。为此，我们提出溢出控制方法，分别在饱和计算模式和封闭计算模式上解决了这一问题。
     4．我们针对SIMD扩展，解决了一阶线性递归饱和计算的向量化问题，避免了在向量部件与串行部件之间交换数组，提高了计算效率。此外我们还提出了周期常量的识别以及变化周期的求解方法，对存在周期常量的多媒体应用程序扩大了可向量化范围。
     5．以Gcc3.5与Aggassiz编译系统为基础实现了两个编译原型系统，并利用这两个原型系统对本文提出的分析与优化技术进行了正确性验证与优化效果分析。
     综合来说，本文以提高SIMD多媒体编译优化能力为出发点，对其中若干关键技术进行深入研究，着重讨论了如何提高分析精度以适应SIMD编译需求，以及如何在饱和模式下开发潜在并行性的方法等问题。并通过实验验证了研究成果的正确性与有效性。
Since multimedia has become a dominating computing field, to meet such a trend, almost all general purposed processor venders have integrated multimedia extensions (MME) to their processors. Due to the potential parallelism and the low caculative precision requirement of multimedia applications most MME are implemented with Single Instruction Multi Data instruction sets.
    Currently, programmers are mainly restricted to using assember to utilize these MMEs with in-lining assembly codes or intrinsic functions. With these methods, the development become extremely inefficient and the code would be hard to be transplanted between different platforms. An alternative way is to make compiler automatically generate SIMD instructions from the code of standard high level programming languages.
    Although SIMD optimization is a part of vectorization, the traditional vectorization technique could not be simply transplanted to SIMD optimization due to the differences between vector processor and SIMD architecture. Currently, there is only few compilers could speedup some individual multimedia applications.
    Based on the deep study to the SIMD architecture and widely analysis to the multimedia workload. We carried on a series research to develop efficient SIMD optimization techniques, and to find out some useful techniques in this area. Meanwhile, we implemented these techniques with open source compiler Gcc3.5 and parallelization research platform Aggassiz as well. The work of this dissertation includes:
    1. This dissertation thoroughly investigates the present research on SIMD architecture, multimedia workload, and SIMD compilation techniques. One conclusion is that there are obvious gaps between SIMD compilation technique and traditional vectorization techniques
    2. To meet the requirement of SIMD compiler, we introduced "valid bits analysis", a dual direct data flow analysis, as the basic analysis for further optimizations, to control the band-width and enhance the parallelism.
    3. Since there are strong conflicts between the pack arithmetic of SIMD architecture and Integer Promotion rule of programming language standard, we introduced the overflow controlled SIMD arithmetic to optimize the related code.
    4. We successfully vectorize the linear recursive arithmetic in saturated mode to avoid exchanging the data between serial components and SIMD components. So that we can enhance the performance of related code. Meanwhile, we proposed the method how to discover Cyc-Constant, with this method more potential parallelism are developed.
    5. We also implemented two prototype system with open source compiler Gcc3.5 and parallelization research platform Aggassiz, and verified the correctness for the SIMD optimization techniques and measure the performance improvement as well.
    Conclusively, to enhance the capacity of SIMD optimization we discussed several critical techniques, such as how to performe highly accurately data bit width analysis and how to develop potential parallelism in saturation arithmetic mode. On the other hand, we verified the correctness and analyzed the performance improvement for the techniques.

引文

[1] Allen R, Kennedy K. Vector register allocation[J], IEEE Trans.Comput. 1992, 41(10): 1290-1317.

    [2] Allen R, Kennedy K, Porterfield C, Warren J, Conversion of control dependence to data dependence[C]. Proceedings of 10~(th) ACM symposium on Principles of programming languages. ACM Press, 1983, 177-189.

    [3] Allen R, Kennedy K. Automatic translation of Fortran programs to vector form[J]. ACM Transactions on Programming Languages and Systems, 1987,9(4): 491-542.

    [4] Aho A V, Sethi R, Ullman J D. Compilers: Principles, Techniques and Tools [M]. Addison-Wesley, 1986.

    [5] Al-Shayka O, Chen S, Mattvelli M. Introduction to the special issue on multimedia implementation [J]. IEEE Transaction on Circuits and Systems for Video Technology, 2002, 12(8): 629-632.

    [6] Appel A W. Modern compiler implementation in Java [M], Cambridge University Press, 1998.

    [7] Bacon D F, Graham S L, Sharp O J. Compiler transformations for high-performance computing[J]. Comput. Surv., 1994,26(4): 345-420.

    [8] Banerjee U K. Dependence Analysis for Supercomputing[M]. Kluwer Academic Publishers, Norwell, MA, USA, 1988.

    [9] Bik A J C. The Software Vectorization Handbook[M]. Intel Press, 2004.

    [10] Bik A J C, Girkar M, Grey P M, Tian X. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems[C]. Intel Technology Journal. 2001.
    [11] Bik A J C, Girkar M, Grey P M, Tian X. Experiments with automatic vectorization for Pentium 4( R) processor[C]. Compilers for Parallel Computers. 2001.

    [12] Bik A J C, Girkar M, Grey P M, Tian X. Automatic intra-register vectorization for the 11th International Workshop on Languages and Compiler for Parallel Program., 2002, 30(2): 65-98.

    [13] Bik A J C, Girkar M, Grey P M, Tian X. Automatic detection of saturation and clipping idioms[C]. Proceedings of the 15th International Workshop on Languages and Compilers for Parallel Computing. 2002.

    [14] Boekhold M, Karkowski I, Corporaal H. Transforming and parallelizing ANSI C programs using pattern recognition[C]. Lecture Notes in Computer Science. 1999.

    [15] Bryant R E, O'Hallaron D R. Computer Systems: A Programmer's Perspective [M]. Prentice Hall, 2003.

    [16] Bulic P, Gustin V. Macro extension for SIMD Processing [C]. Proceedings of the 7 th International Euro-Par Conference Manchester on Parallel Processing. Springer-Verlag, 2001, 448-451.

    [17] Bulic P, Gustin V. An extended ANSI C for processors with a multimedia extension[J]. Int. J. Parallel Program., 2003, 31(2): 107-136.

    [18] Bulic P, Gustin V. Data dependence analysis for tntra-register vecorization[C]. Second International Symposium on Parallel and Distributed Computing. 2003, 50-56.

    [19] Carlson D A, Castelino R W, Mueller R O. Multimedia extension for 550mhz RISC microprocessor[J]. IEEE Journal of Solid-State Circuits, 1997, 32(11): 1618-1624.

    [20] Cheong G, Lam M S. An optimizer for multimedia instruction sets[C]. Proceedings of 2nd SUIF Compiler Workshop, Stanford University. 1997.

    [21] Conte T M, Dubey P K, Jennings MD, Lee R B, Peleg A, Rathnam, Schlansker M, Song P, Wolf A. Chanllenges to combining general-purpose and multimedia processors[J]. Computer, 1997, 30(12): 33-37.

    [22] Corbal J, Valero M, Espasa R. Exploiting a new level of DLP in Multimedia applications[C]. Proceedings of 32nd annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society, 1999, 72-79.

    [23] Crescent Bay Software Corp. http://www.psrv.com/vast altivec.html, 2003.

    [24] Diefendorff K, Dubey P K, Hochsprung R, Scales H. AltiVec extension to PowerPC accelerates media processing[J]. IEEE Micro, 2000,20(2): 85-95.

    [25] Eichenberger A E, Wu P, O'Brien K.Vectorization for SIMD architectures with alignment constraints[C]. Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation. ACM Press, 2004, 82-93.

    [26] Eijndhoven J T J V, Pol E J D. TriMedia CPU 64 architecture [C]. ICCD'99: Proceedings of 1999 IEEE International Conference on Computer Design. IEEE' Computer Society, 1999, 586.

    [27] Ferretti M. Multimedia extensions in super-pipelined micro-architectures. Anew case for SIMD processing[C]. Proceedings of the 5~(th) IEEE International Workshop on Computer Architecture for Machine Perception(CAMP'00). IEEE Computer Sciety, 2000,249.

    [28] Ferretti M, Rizzo D. Multimedia extensions and sub-word parallelism in image processing: Preliminary results[C]. Proceedings of the 5th International Euro-Par Conference on Parallel Processing. Spinger-Verlag, 1999,977-986.

    [29] Fisher R J, Dietz H G. Compiling for SIMD within a register[C]. Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, 1999,290-304.

    [30] Franchetti F, Puschel M. A SIMD vectorizing compiler for digital signal processing algorithms[C]. Proceedings of the 16th International Symposium on Parallel and Distributed Processing. IEEE Computer Sciety, 2002, 202.

    [31] Fraser C W, Hanson D R, Proebsting T A. Engineering a simple, efficient code-generator[J]. ACM Lett. Program. Lang. Syst., 1992, 1(3): 213-226.

    [32] Frittis J, Mangione-Smith W H. Mediabench II - technology, staus, and coorperation[C]. 4th Workshop on Media and Stream Processors. 2002.

    [33] Gaffar A A, Mencer O, Luk W, Cheung P Y, Shirazi N. Floating-point bitwidth analysis via automatic differentiation[C]. IEEE International Conference on Field-Programmable Technology. 2002, 1135-1200.

    [34] Gustin V, Bulic P. Extracting SIMD parallelism from 'for' loops[C]. Proceedings of the 2001 In ternational Conference on Parallel Processing Workshops.IEEE Computer Society, 2001, 23.

    [35] Hennessy J L, Patterson D A. Computer Architecure: A Quantitative Aproach. 3rd edition[M]. Morgan Kaufmann, 2003.

    [36] Intel Corporation. IA32 Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture[M]. Intel Press, 2004.

    [37] Intel Corporation. IA32 Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference[M]. Intel Press, 2004.

    [38] Intel Corporation. IA32 Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide[M]. Intel Press, 2004.

    [39] Intel Corporation. Intel c/c++ compiler user's guide[EB/OL].
    [40] http://www.developer.intel.com, 2003.

    [41] Intel Corporation. Intel math kernel library[EB/OL].

    [42] http://www.intel.com/software/products/mkl/, 2004
    [43] Intel Corporation. Intel performance primitive[EB/OL].

    [44] http://www.intel.com/software/product/perflib/, 2004.

    [45] Intel Corporation. IA32 Intel Architecture Optimization Reference Manual[M]. Intel Press, 2002.

    [46] Ireneusz R M. Evaluation of a potential for automatic SIMD parallelization of embedded applications[EB/OL]. http://citeseer.ni.nec.com/232583.html.

    [47] Juurlink B H H, Vassiliadis S, Tcheressiz D, Wijshoff H A G. Implementation and evaluation of the complex streamed instruction set[C]. Proceedings of the 2001 International Conference on Parallel Architectures and Coplilation Techniques. IEEE Computer Society, 2001, 73-82.

    [48] Keith Cooper A D, Kennedy K. Vizer: A system to vectorize intel x86 binaries[C]. Proceedings of the Third Annukal Symposium of the Los Alamos Computer Science Institute. 2002.

    [49] Kennedy K, Allen J R. Optimizing compilers for modern achitectures: a dependence-based approach[M]. Morgan Kaufmann Publishers Inc., 2002.

    [50] Kohn L, Maturana G, Tremblay M, Prabhu A, Zyner G The Visual Instruction Set (VIS) in UltraSPARC[J]. Proceedings of Compcon'95. 1995.

    [51] Krall A, Lelait S. Compilation techniques for multimedia processors[J]. International Journal of Parallel Programming, 2000, 28(4): 347-361.

    [52] Kuch D L. Structure of Computers and Computation[M]. John Wiley&Sons, Inc., 1978.

    [53] Lappalainen V, Hamalainen T D, Liuha P. Overview of research efforts on media ISA extensions and their usage in video coding[J]. IEEE Trans. on Circuits and Systems for Video Technology, 2002, 12(8): 660-670.
    [54] Lappalainen V, Liuha P, Hamalainen T D. Current research efforts in media ISA development[C]. 4th Workshop on Media and Stream Processors.2002.

    [55] Larsen S, Amarasinghe S. Expoiting superword level parallelism with multimedia instruction sets[J]. ACM SIGPLAN Notices, 2000, 249.

    [56] Larsen S, Amarasinghe S. Increasing and detecting memory address congruence[C]. 11th International Conference on Parallel Achitectures and Compilation Techniques(PACT). 2002.

    [57] Lee C, Potkonjak M, Mangione-Smith W H. Mediabench: a tool for evaluating and sythesizing multimedia and communications systems[C]. Proceedings of the 30th annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 1997, 330-335.

    [58] Lee R B. parallelism with MAX-2[J]. IEEE Micro, 1996, 16(4): 51-59.

    [59] Lee R B. Accelerating multimedia with enchanced microprocessors[J]. IEEE Micro, 1995, 15(2): 22-32.

    [60] Lee R B, Smith M D. Media processing: A new design target[J]. IEEE Micro, 1996, 16(4): 6-9.

    [61] Leupers R. Code selection for media processors with SIMD instructions[C]. Proceedings of the Confernce on Design, Automation and Test in Europe. ACM Press, 2000,4-8.

    [62] Liao S, Devadas S, Keutzer K. A text-compression-based method for code size minimization in embedded systems[J]. ACM Trans. Des. Autom. Electron. Syst., 1999, 4(1): 12-38.

    [63] Lorenz M, Marwedel P, Drager T, Fettweis G, Leupers R. Compiler based exploration of DSP energy saving by SIMD operations[C]. ASP-DAC'04: Proceedings of 2004 Conference of Asia South Pacific on Design Automation. IEEE Press, 2004, 838-841.

    [64] Manniesing R, Karkowski I, Corporaal H. Automatic SIMD parallelizing ANSI C programs using pattern recognition[C]. Proceedings of 6th International Euro-Par Conference. 2000, 349-356.
    [65] MIPS Technologies. MIPS extension for digital media with 3D[EB/OL]. http://www.mips.com, 1997.

    [66] MIPS Technologies. MIPS-3 ASE[EB/OL]. http://www.mips.com, 1999.

    [67] Muchnick S S. Advanced compiler implementation in Java[M]. Cambridge University Press, 1988.

    [68] Naishlos D, Biberstein M, Ben-David, Zaks A. Vecorizing for a SIMD DSP architecure[C]. Proceedings of 2003 International Conference on Compilers, Architectures and Synthesis for Enbeded Systems. ACM Press, 2003,2-11.

    [69] ObermanS, Favor G, Weber F. AMD 3DNow! Technology: Architecure and implementations[J]. IEEE Micro, 1999,19(2): 37-48.

    [70] Padua D, Wolfe M. Advanced compiler optimizations for supercomputers[J]. Communications of the ACM, 1986,29(12): 1184-1201.

    [71] Pajuelo A, Gonzalez A, Valero M. Speculative dynamic vectorization[C]. ISCA. 2002, 271-280.
    [72] PelegA, Weiser U. MMX technology extension to intel architecture[J]. IEEE, 1996, 16(4): 42-50.
    [73] Pokam G, Simonnet J, Bodin F. A retargetable processor for multimedia instrucions[C]. Compilers and Parallel Computers. 2001.
    [74] Potteeger B, Eigenmann R. Idiom recognition in the Polaris parallelizing compiler[C]. ICS'95: Proceedings of the 9th International Conference on Supercomputing. ACM Press, New York, NY, USA, 1995, 444-448.
    [75] Portland Group Compiler Technology. Portland group compiler[EB/OL]. http://www.pgroup.com/products/workpgi,htm, 2003.
    [76] Pryanishnikov I, Krall A, Horspool N. Pointer alignment analysis for processors with SIMD instructions[C]. 5th Workshop on Media and Stream Processors. 2003.
    [77] Pugh W. The omega test: A fast practical integer programming algorithm for dependence analysis[C]. Proceedings of the 1991 ACM/IEEE Conference on Supercomputing. ACM Press, New York, NY, USA, 1991, 4-13.
    [78] Ren G, Wu P, Padua D. A preliminary study on the vectorization of multimedia applications[C]. 16th International Workshop of Languages and Compilers for Parallel Computing. 2003.
    [79] Scott K, Davidson J. Exploring the limits of sub-word level parallelism[C]. Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 2001, 81.
    [80] Skadron K, Huphrey M, Huang B, Hilton E, Luo J, Allaire P. The use of mini-vector instructions for implementing high speed feedback controllers on general-purpose computers[C]. Proceedings of the 3rd Workshop on Media and Stream Processors. 2001.
    [81] Slingerland N T. Architecures for Multimedia[D]. Master's Thesis, UC Berkeley, 2000.
    [82] Slingerland N T, Smith A J. Design and characterization of the Berkeley multimedia workload[J]. Multimedia Systems, 2002, 8(4): 315-327.
    [83] Slingerland N T, Smith A J. Measuring the performance of multimedia instruction sets[J]. IEEE Trans, on Computers, 2002, 51(11): 1317-1332.
    [84] Slingerland T, Smith A J. Multimedia instruction sets for general purpose microprocessors. Technical Report, UC Berkley, 2000.
    [85] Slingerland T, Smith A J. Cache performance for multimedia applications[C]. Proceedings of the 15th International Conference on Supercomputing. ACM Press, 2001, 204-217.
    [86] Sreaman N, Govindarajan R. A vectorizing compiler for multimedia extensions[J]. Int. J. Parallel Program., 2002, 28(4): 363-400.
    [87] Stephenson M, Babb J, Amarasinghe S. Bitwidth analysis with application to silicon compilation[C]. ACM SIGPLAN Conf. on Programming Language Design and Implementation. June 2000, 108-120.
    [88] Talla D, John L K, Burger D. Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements[J]. IEEE Trans. Compute., 2003, 52(8): 1015-1031.
    [89] Thakkar S T, Huff T. Internet streaming SIMD extensions[J]. Computer, 1999, 32(12): 26-34.
    [90] Vajapeyam S, Joseph P J, Mitra T. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs[C]. ISCA'99: Proceedings of the 26th Annual International Symposium on Computer Architecure. IEEE Computer Society, 1999,16-27.
    [91] Weihardt M, Luk W. Pipeline vectorization[C]. The 7th Annual IEEE Simposium on Field-Programmable Custom Computing Machines FCCM'99. 1999.
    [92] Zhou J, Ross K A. Implementing database operations using SIMD Instructions[C]. Proceedings of the 2002 ACM S1GMOD International Conference on Management of Data. ACM Press, 2002,145-156.
    [93] Zheng B, Tsai J Y, Zhag B Y, Chen t, Huang B, Li J H, Ding Y H, Liang J, Zhen y, Yew P C, Zhu C Q. Designing the Agassiz comoiler for concurrent multithreaded architecures[C]. 12th International Workshop on Languages and Compilers for Parallel Computing. 1999, 380-398.
    [94] Zhu J, Jiang W. Overflow controlled SIMD arithmetic[C]. Proc.of 17~(th) LCPC, West Lafayette, Indiana, September 2004: Springer-verlag, 424-438.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700