循环变换技术在自动向量化中的应用研究

英文题名：Research on Application of Loop Transformation for Auto-vectorization
作者：黄磊
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：自动向量化 ; 循环分布 ; 循环合并 ; 循环展开 ; 交互式变换
英文关键词：Auto-vectorization ; Loop Distribution ; Loop Fusion ; Loop Unroll ; Interactive Transformation
学位年度：2011
导师：姚远
学科代码：081202
学位授予单位：解放军信息工程大学
论文提交日期：2011-04-15

摘要

近年来,随着多媒体应用的迅速发展,很多高性能微处理器都采用了SIMD扩展技术。SIMD扩展技术缺少统一的指令描述规范,程序员不仅需要对程序的结构有较为深刻的认识,而且要掌握目标平台支持的扩展体系结构和指令集的特点,对程序员来说编写SIMD程序是具有挑战性的工作,因此当前编译技术中的自动向量化就是解决这个问题的一个重要方法。
     自动向量化需要编译器能够充分分析原程序的特点,将其中可向量化的循环或语句块转换成语义等价的SIMD指令。本文针对当前的主流编译器在自动向量化方面的缺点,结合传统的循环变换方法,深入研究了可以提升自动向量化效率的循环变换技术。首先,以数据依赖关系分析为基础,分析了程序中语句向量化的合法性,设计实现了语句向量化识别算法及基于语句向量化识别的循环分布算法。第二,针对SIMD体系结构中标量部件与向量部件可并行工作的特点,研究了基于混合并行的循环变换技术,设计实现了一种简单通用的循环选择合并算法,并针对合并不成功或性能不能达到预期效果的情况,提出了相应的分段展开变换策略,提高了系统的硬件资源利用率。第三,针对自动向量化变换在处理复杂应用程序所存在的不足,研究了基于信息识别的交互式变换方法,设计实现了一种交互式的循环变换调优框架。在该框架结构内,通过分析应用程序的特点,采用可视化动静态编译信息的方法,为用户提供一个高效的渐近式SIMD交互调优环境。
     本文研究的循环变换方法在自动向量化项目SW-VEC中进行实现,测试结果表明,使用本文的自动变换方法对SPEC CPU2000浮点测试集中几个向量化较好的测试程序性能平均有10%左右的提升,从而验证了本文提出的方法可以提升向量化识别率并能够得到较高的加速比。同时,测试结果表明,通过交互式的循环变换方法,可以弥补自动向量化中难以处理的复杂情况,进一步提高编译器的向量化能力。
In recent years, as the rapid development of multimedia applications, SIMD extension technologies have been widely used in many high performance microprocessors. Due to lacks of unitive rules to describe the instructions for SIMD extension technologies, the programmer not only need to profoundly understand the structure of program, but also need to know the characteristics of the structure and instruction sets of corresponding SIMD extension, which bring great challenges to writing SIMD program by hand. So lots of current compilers adopt auto-vectorization to solve this problem.
     In order to realize auto-vectorization, compilers need to completely identify the program’s characteristics and transform the loops or basic blocks to SIMD instructions with equal function. Considering the limit of current auto-vectorization compilers, this thesis makes researches to improve the efficiency of auto-vectorization with methods of loop transformation. Firstly, based on analyzing the data dependence relationship, an algorithm for identifying statement’s vectorization and a loop distribution algorithm based on identifying statement’s vectorization ability is discussed and implemented. Secondly, considering the SIMD assembly and scalar assembly can work together in SIMD computer system, some loop transformation methods for mix-parallel are researched. Then this thesis designs and implements a common loop fusion algorithm. If the algorithm can’t achieve the expect effect, a corresponding loop unroll policy is brought out. Thirdly, in order to help auto-transformation when dealing with the complexity applications, this thesis designs a tune frame for interactive loop transformation after researching the relative interactive transformation methods based on information recognizing. In the frame, an effective interactive transformation environment is provided to users by displaying the dynamic and static compiler information through analyzing the applications.
     These loop transformation methods have been realized in current auto-vectorization project, and the test results show that with our auto-transformation methods, the performance have rised up 10% averagely for some programs which are fit for vectorization in the SPEC CPU2000 float test sets. It’s proved that our transformation methods can improve the vector recognization and execution efficiency. At the same time, it shows that with the interactive transformation method, we can solve some complex instances which can’t be solved by the auto-vectorization compiler, and farther improve the program vectorization ability of the compiler.

引文

[1] The Software Optimization Cookbook:High-Performance Recips for IA-32 Platforms (Second Edition) ISBN 978-7-121-04005-4 [美]Richard Gerber,Kevin B.Simith,[荷]Aart J.C.Bik,[加]Xinmin Tian著.
    [2] Intel Corp. "Intel C/C++ and Intel Fortran Compilers for Linux", Information available at http://www.intel.com/software/products/compilers .
    [3] Free Software Foudation, GCC, http://gcc.gnu.org.
    [4] Pathscale Compiler User’s Guide .2.0 http://www.pathscale.com/.
    [5] Samuel Larsen, Saman Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proc.Of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun 2000, page 145-156.
    [6] Peng Wu, Alexandre E.Eichenberger Amy Wang, Peng Zhao. An Integrated Simdization Framework Using Virtual Vectors. ICS’05, June 20-22 2005, Boston, MA, USA.
    [7] Alexei Kudriavtsev, Peter Kogge. Generation of Permutations for SIMD Processors. Submitted to LCTES’05 Chicago, Illinois, June 15-17 2005.
    [8]李玉祥面向非多媒体程序的SIMD向量化方法及优化技术研究北京中科院计算技术研究所. 2008 [博士学位论文].
    [9] Boekhold M, Karkowski I, Corporaal H. Transforming and parallelizing ANSI C programs using pattern recognition[C]. Lecture Notes in Computer Science. 1999.
    [10] Manniesing R, Karkowski I, Corporaal H. Automatic SIMD parallelization of embedded applications based on pattern recognition[C]. Proceedings of 6th International Euro-Par Conference. 2000: 349-356.
    [11] Dorit Naishlos, IBM Research Lab in Haifa, Autovectorization in GCC .GCC Developers' Summit 2004 105-117.
    [12] G.Cheong and M.S.Lam, An optimizer for multimedia instruction sets. In The Second SUIF Compiler Workshop, Stanford University, USA, August 1997.
    [13] L.Bachega,S.Chatterjee,K.Dockser,J.Gunnels,M.Gupta,F.Gustavson,C.Lapkowski,G.Liu,M.Mendell,C.Wait,T.J.C. Ward. A High-Performance SIMD Floating Point Unit for BlueGene/L:Architecture,Compilation,and Algorithm Design.Parallel Architectue and Compilation Techniques(PACT 2004),Antibes Juan-les-Pins,France,Sept-Oct 2004.
    [14] Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco, California, 2001.
    [15]陈火旺,刘春林,谭庆平,赵克佳,刘越编著.程序设计语言编译原理(第3版)国防工业出版社2000:369.
    [16]沈志宇,胡子昂,廖湘科,吴海平,赵克佳,卢宇彤著.并行编译方法(第1版)国防工业出版社,ISBN 7-118-02209-8. 2000.7
    [17] Randy Allen, Ken Kennedy, Automatic Loop Interchange. 20 Years of the ACM/SIGPLANConference on Programming Language Design and Implementation (1979-1999): A Selection, 2003.
    [18] Open64, http://open64.sourceforge.net.
    [19] DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP .Computer Science Division, Unwersbty of California, Berkeley, California 94720. Compiler Transformations for High-Performance Computing. ACM Computing Surveys, Vol. 26, No. 4, December 1994.
    [20] Randy Allen, Ken Kennedy:"Automatic Translation of Fortran Programs to Vector Form".ACM Trans.On Programming Languages and Systems, 1987, 9(4):491-542.
    [21]曾扬.循环分布及依赖关系破除的优化问题[J].计算机学报, 1993年6月, 16(6).
    [22] Kenkennedy, KathrynS.McKinley. Loop Distribution with Arbitrary Control Flow.Rice University Department of Computer Science.
    [23] K.Kennedy and K.McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In U.Banerjee, D.Gelernter,A.Nicolau, and D.Padua, editors, Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science , Number 768, pages 301-320, Springer-Verlag,Berlin, 1993
    [24] K.Kennedy. Fast greedy weighted fusion. In Proceedings of the 2000 ACM International Conference on Supercomputing, May 2000.
    [25] K.Kennedy. Fast greedy weighted fusion. International Journal of Parallel Processing, 29:5, Octorber 2001.
    [26] Antoine Fraboulet, Karen Kodary, Anne Mignotte. Loop Fusion for Memory Space Optimization. ISSS’01, October 1-3, 2001, Montreal, Quebec, Canada.
    [27] Google Daniel von Dincklage, Amer Diwan Department of Computer Science University of Colorado. Optimizing Programs with Intended Semantics. OOPSLA’09, October 25–29, 2009, Orlando, Florida, USA.
    [28] Daniel von Dincklage, Amer Diwan Department of Computer Science University of Colorado. Explaining Failures of Program Analyses. PLDI 2008 Tucson, Arizona USA.
    [29] Barbara Chapman, Oscar Hernandez, LeiHuang, Tien-hsiung Weng, Zhenying liu, Laksono Adhianto,Yi Wen .Dragon:An Open64-Based Interactive Program Analysis Tool for Large Applications. http://www.cs.uh.edu/~dragon.
    [30] OpenUH. http://www.cs.uh.edu/?openuh.
    [31]陈文光,杨博,王紫瑶,郑丰宙,郑纬民.一个交互式的Fortran77并行化系统[J].软件学报, 1999.12,10(12),pp1259-1267.
    [32] R.E.Tarjian. Depth first search and liner graph algorithms. SIAM Journal of Computing 1(2):146-160,1972
    [33] T.Lengauer and R.E.Tarjian. A fast algorithm for finding dominators in a flowgraph[C]. ACM Transactions on Programming Languages and Systems, 1(1):121-141, July 1979.
    [34] D. E. Maydan, J. L. Hennessy, M. S. Lam. Efficient and exact data dependence analysis [A]. In Proceedings of the SIGPLAN 1991 Programming Language Design and Implementation[C], US, 1991, 1-14.
    [35] M.Wolfe and Chau-Wen Tseng. The Power Test for Data Dependence [J]. IEEE Transactions on Parallel and Distributed Systems, 1992, 3(5): 591-601.
    [36] [美] Kai Hwang著,王鼎兴,沈美明,郑纬民,温冬蝉译.高等计算机系统结构并行性可扩展性可编程性[M].清华大学出版社. 1995.8:441
    [37]白中英杨旭东邝坚主编,并行机体系结构(第二版.网络版)[M].科学出版社.2006.1
    [38] (美) Alfred V.Aho, Monica S.Lam,Ravo Sethi,Jeffrey D.Ullman著,赵建化,郑滔,戴新宇译.编译原理[M]机械工业出版社2009.1
    [39] Samuel Larsen Submitted to the Department of Electrical Engineering and Computer Science on April 14, 2006, in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Compilation Techniques for Short-Vector Instructions.
    [40] Samuel Larsen, Rodric Rabbah, and Saman Amarasinghe. Selective Vectorization for Short-Vector Instructions. MIT-CSAIL-TR-2009-064, December 18, 2009
    [41] Ross Tate, Michael Stepp, Sorin Lerner University of California, San Diego. Generating Compiler Optimizations from Proofs. POPL’10, January 17–23, 2010.
    [42] Sorin Lerner, Todd Millstein, Craig Chambers, Department of Computer Science and Engineering University of Washington. Automatically Proving the Correctness of Compiler Optimizations .PLDI’03, June 9–11, 2003.
    [43] J.R.Allen, L.Kennedy, C.Porterfield, and J.Warren. Conversion of control dependence to data dependence. In Conferrence Record of the Tenth Annual ACM Symposium on the Principles of Programming Languages, January 1983.
    [44] J.Shin, M.Hall, and J.Chame. Supeword-Level Parallelism in the Presence of Control Flow. In Proc. of International Symposium on Code Generation and Optimization (CGO), pages 165-175, March 2005.
    [45] R.Cytron, J.Ferrante, B.K.Rosen, M.Wegman, and F.K.Zadeck. Efficiently computing static single assignment form and control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):452-490, October 1991.
    [46] Dorit Nuzman, Ira Rosen, Ayal Zaks. Auto-Vectorization of Interleaved Data for SIMD. PLDI’06 June 10-16, 2006, Ottawa, Ontario, Canada.
    [47] Byunhyun Jang, Perhaad Mistry, Dana Schaa, Rodrigo Dominguez, and David Kaeli. Data Transformations Enabling Loop Vectorization on Multithreaded Data Parallel Architectures. PPoPP’10, January 9-14, 2010, Bangalore, India. ACM 978-1-60558-708-0/10/01.
    [48] B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli,“Data Transformations Enabling Loop Vectorization, NUCAR Technical Report,”Nov 2009, http://www.ece.neu.edu/groups/ nucar/publications.html.
    [49] Gang Ren, Peng Wu, David Padua. Optimizing data permutations for SIMD devices. Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, ACM Press, 2006:118– 131.
    [50] Alexandre E.Eichenberger, Peng Wu, Kevin O’Brien. Vectorization for SIMD Architectures with Alignment Constraints. PLDI’04, June 9-11, 2004, Washington, DC, USA.
    [51] I.Pryanishnikov, A.Krall, and N.Horspool. Pointer Alignment Analysis for Processors with SIMD Instructions. In Proc, of the 5th Workshop on Media and Streaming Processors at Micro’03, pages 50-57, December 2003.
    [52] S. T. Leung and J. Zahorjan,“Optimizing data locality by array restructuring,”University of Washington, Tech. Rep. TR 95-09-01, 1995.
    [53] Kathryn S.Mckinley, Steve Carr, Chau-wen Tseng.Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems, 18(4), July 1996.
    [54] Ming Yang, Yuan Yao, Shuai Wei, Yuanyuan Zhang, Lei Huang. A Technology Based Benefit Analysis on Reuse of Vector Register for SIMD Vectorization Optmization[C]. In proceeding of ISISE. SHANGHAI. 2010. 101-104.
    [55] SPEC2000: www.spec.org

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700