SIMD自动向量识别及代码调优技术研究

英文题名：Research on Automatic SIMD Vectorization Recognization and Code Tuning Technology
作者：姚远
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：SIMD自动向量化 ; 连续性分析 ; 对齐分析 ; 依赖关系分析 ; 控制依赖 ; 向量寄存器重用 ; 交互式性能调优 ; 反馈调优 ; 编译指示
英文关键词：SIMD automatical vectorization ; continuity analysis ; alignment analysis ; dependency analysis ; control dependence ; vector registers reuse ; interactive performance tuning ; feedback tuning ; pragmas
学位年度：2012
导师：赵荣彩
学科代码：081202
学位授予单位：解放军信息工程大学
论文提交日期：2012-04-15

摘要

多媒体应用通常具有计算规则、密集、并行度高的特点，SIMD功能部件和SIMD扩展指令集能够对多媒体程序中的数据进行并行处理，可较好地提升多媒体程序的运行速度。而随着多媒体程序日益复杂，SIMD多媒体指令集已得到很大扩充，利用SIMD扩展实现应用程序的性能加速也已经走出传统的多媒体范畴，并日趋成熟地进入到科学计算领域。尽管当前很多商用编译器都可以对应用程序进行自动向量化，但是所生成的向量化代码性能普遍不高，SIMD自动向量化及编译优化仍然存在大量的困难，例如对结构体和指针的引用，数据依赖判定和控制依赖的转换等，都严重阻碍了向量化编译器进行向量化发掘。为了完善SIMD向量化识别，进一步发掘和优化应用程序中的向量化并行性，我们开发了面向国产高性能多核处理器SW1600SIMD多媒体扩展的源到源自动向量化工具SW-VEC，通过对串行源程序的分析、优化和重构，发掘应用程序中的SIMD数据并行性，生成符合国产多核处理器SW1600结构特点的高效SIMD源程序。本文研究了在构建自动向量化系统中主要影响向量识别与性能优化的关键技术，重点对向量化预优化、SIMD依赖关系分析、SIMD向量化并行性发掘与性能优化等各阶段中的关键技术开展了研究，同时根据课题需要，构建了基于交互的SIMD向量化代码性能调优框架，对其中所涉及到的交互界面设计、反馈式调优信息获取和编译指示语句设计实现等关键技术进行了研究。论文创新工作主要体现在以下四个方面：
     1、在SIMD向量化预优化中，论述了SIMD向量化连续性和对齐性的分析和优化方法，提出了针对结构体和指针的连续性和对齐分析优化方法。现有的SIMD向量化方法对于结构体和指针结构，采用的方法都是局部或者全局的数据重组策略，定义新的存储结构，将结构体或指针结构安排到新的存储空间中，改变其原有的存储布局结构实现向量化，这种方法带来了额外的空间和时间开销，影响了所生成的向量化代码性能。为此，论文对结构体的不连续和不对齐问题，提出了结构成员重排方法，该方法通过判定结构体成员引用是否存在同构语句，对结构体成员顺序进行调整和填充，实现结构体的连续和对齐分析变换；对程序中的指针结构，通过对指针引用的跟踪记录，确定指针引用的连续和对齐关系。实验结果表明，结构体成员重排方法及指针连续对齐分析和优化方法，可有效提高向量化识别率和生成的向量化代码性能。与局部数据重组优化方式相比，对测试程序的核心代码片断，结构体成员重排优化方法所获得的性能加速比由原来负加速比提高到300％以上，对Intel编译器显示无法向量化的指针代码片断，通过指针连续性和对齐分析优化，实现了对指针的SIMD向量化识别，并获得了7％～43％的性能加速比。
     2、在SIMD向量化的数据依赖关系判断中，提出了基于数据依赖距离和向量化因子间关系的SIMD数据依赖关系判定形式化方法，并在此基础上，提出了依赖环中解除反依赖关系算法并实现了循环分布。针对循环中的控制流结构，提出了基于控制依赖图的SIMD向量化方法，根据扩展的控制依赖图建立执行变量数组，保存了条件表达式计算。通过该方法，不仅可在执行变量数组赋值时实现SIMD自动向量化，而且可对执行变量数组的比较采用多版本向量化方法，将执行变量的不确定性引起的阻碍向量化影响限制在最小范围，扩大了向量化识别率范围，有效提升了程序的性能。实验结果表明，所提出的SIMD依赖关系分析和针对控制流结构的向量化分析和优化变换，可有效提高向量化识别率和性能，与Intel11.0版编译器相比，SW-VEC加速比最高提升约35％，平均加速比提高约21％，与未采用数据依赖与控制依赖分析优化相比，SW-VEC加速比最高提升约30％，平均加速比提高约17％。
     3、在SIMD并行性发掘及向量化代码性能优化中，根据SIMD向量化并行性发掘的不同阶段，提出了面向循环变换和基本块SLP并行性发掘的自动向量化代价收益计算方法，通过该计算方法来指导这两个阶段中SIMD向量化的并行性发掘和性能优化变换对不同方案的选择。针对生成的向量化代码，提出了通过循环交换和循环展开压紧技术实现向量寄存器重用的优化方法。前者通过发掘循环中的部分数据引用与循环索引的无关性，利用循环交换将循环内层中与向量化循环无关的向量计算提取到循环外层，消除向量在循环中的冗余装载和计算操作，提高向量寄存器的重用，后者通过考察比较循环携带依赖的依赖距离和向量化因子间的关系，当循环中存在循环携带的流依赖、输出依赖或输入依赖时，可实现向量寄存器的全部和部分重用优化。另外，针对SIMD功能部件和标量部件的并行性，提出了循环中向量和标量混合并行方法。通过循环分段展开改变循环内语句的执行方式，将循环内语句分成向量化和标量语句部分，当两部分间没有依赖关系时，两类语句可分别在SIMD向量部件与标量功能部件并行执行，提高了系统的资源利用率。实验结果表明，向量寄存器重用优化结合收益代价的分析，可以较好地实现向量化代码性能的提升，利用循环分段展开算法可实现向量和标量混合并行，有效提高所生成向量化代码的性能，平均加速比提高约12%。
     4、提出了基于交互的向量化性能调优框架的构建。该框架融合了向量化调优窗口界面、静态SIMD向量化识别与反馈式性能调优及分析和向量化编译指示语句的插入等三部分。通过SIMD向量化代码性能调优框架，在向量化调优窗口界面中将动态反馈式性能调优与静态向量化代码生成进行了有机结合，同时配合规范、完备的向量化编译指示，可有效提高向量化代码性能。实验结果表明，通过交互式向量化性能调优，SPEC CPU2000中部分测试程序性能得到了较好地提升，其性能加速比最高可提升约50％，平均加速比比优化前提高了约10％。
     论文最后对SW-VEC系统整体的SIMD向量化识别率和生成的向量化代码性能进行了测试。实验结果表明，SW-VEC自动向量化识别率要优于Intel11.0版编译器，性能加速比高于Intel11.0编译器约16%。对于行业测试集，交互式向量化性能调优所获得的性能加速比与手工改写所获得的性能加速比已经比较接近，平均性能可达到手工改写代码性能加速比的90％以上，说明了交互式向量化代码性能调优框架具有较好的实用性。
Multimedia applications usually have intensive calculation and parallel high features. Usingthe SIMD function unit and SIMD instruction set extensions, the research can enhance the speedof the multimedia program. With the complexity of multimedia programs increasingly, the SIMDmultimedia instruction set has been great expanded. And the use of SIMD extensions applicationperformance acceleration has been out of the traditional areas of multimedia, and matures into forscientific computing. Although many commercial compiler can automatically vectorizationprogram, but the code performance which generated is generally not high. There are still a lot ofdifficulties for the SIMD compiler optimization, such as reference to the structure and pointer, thedata rely on judgment and control dependent conversion seriously hindered the excavationscarried out to quantify the compiler vectorization. In order to improve the SIMD vectorizationidentification and to further explore and optimize the vectorization parallelism in the application,the thesis have developed a source to source automatic vectorization tool SW-VEC for thedomestic high-performance multi-core processors SW1600SIMD multimedia extensions. Thetool can apply serial source code analysis, optimization and reconstruction to explore the SIMDdata-parallel application and generated efficient SIMD source automatically to meet thecharacteristics of multi-core processors SW1600structure.This dissertation studies the keytechnology in building vectorization tool which the impact the vectorization identification andperformance optimization, and mainly focus on pre-optimized to vectorizaton, the SIMDdependency analysis, the SIMD parallizaton exploriton and performance optimization in variousstages of vectorization. And for the need of our project, the thesis constructed the interactionSIMD vetorization code performance tuning framework, which involved the interface design, theacquisition of feedback tuning information and the realization of compiler directive statement.The dissertation innovation is reflected in the following four aspects:
     1、Discussed the continuity and alignment analysis and optimization methods in thepre-optimized stage and proposed analysis and optimization methods for the continuity andalignment for structures and pointers. The existing SIMD vectorizaiton methods for the structureand pointer structures are mainly local or global array restructuring strategy, which define a newstorage structure and arrange structure or pointer to the new storage space to change its originalstorage layout structure. This approach has brought additional space and time overhead, impactgenerated vectorizaiton code performance. To this end, the dissertation proposed the method ofrearrangement of the structural members. Accoding to judge the existence of the isomorphismstatements in structure member references, the methos adjust and fill the order of structuremembers, to achieve the structure continuous and alignment analysis of transformation. For thepointer structure in the program, the thesis proposed the method of tracking and recording of thepointer reference to determine the continuous and aligned relationship. The experimental resultsshow that the structure member rearrangement and pointer for alignment analysis andoptimization method can effectively improve vectorizaton recognition rate and the performanceof generated vectorization code. Compared with local data restructuring optimization, for the core of the test program code, the structure member rearrangement optimization method get theperformance speedup from the original negative speedup increased to more than300%. Compareto the Intel compiler which appears unable to vectorize pointer code in the test program, thepointer continuity and alignment analysis optimization realized the SIMD vectorizationrecognition for pointer structure and get7%to43%speedup.
     2、In the SIMD data dependence anaylise, the thesis propsed the formal methods based ondata dependent distance and vectorization fator, And on this basis, the thesis propsed theanti-dependency elimanatin algorithm for the determaination ring and realized the loopdistrubition. For the control flow structure in the loop, the thesis proposed the SIMDvectorization method base on the control dependence graph. The method creates theimplementation of the array of variables according to the extended control dependence graph, andsaves the computation in the conditional expression. Using this method, not only the thesis can doSIMD automatic vectorization for the variable array assignment, but also the thesis can domulti-version vectorization for the comparation for the variable array. This method made theimpact which caused by the uncertainty of variable array to a minimum and extended the range ofvectorization recognition rate, and effectively improve the performance of the program. Theexperimental results show that the proposed SIMD dependency analysis and the optimazation andtransformation for the control flow structure analysis can effectively improve the vectorizationrecognition rate and program performance. Compared with the Intel11.0version compiler, thespeedup can improve by about35%for SW-VEC and the average speedup increases by about21%. Compared to the acceleration which not use data dependence and control dependenceanalysis and optimization proposed in the atical, SW-VEC can improve more than the highest ofabout30%and average speedup increase of about17%.
     3、 In the exploration of SIMD parallellism and the performance optimization tovectorization code, according to the different stages in the exploration of SIMD vectorizationparallellism, the thesis proposed the cost-benefit calculation method for loop transformation andbasic block SLP parallelism exploration and guided the choice of different options for SIMDvectorizaton exploration and performance optimization in the two-stage transformation. Aimed tothe generated vectorized code, the thesis propsed optimization method for the vector registerreuse through loop interchange and loop unrolling. The former can explore the independencybetween vector data reference and the loop index, and extract loop-independent vector calculationin the inner to the outer loop layer by loop interchange, which eliminated redundant vector loadand computing operations in the inner loop layer and improved vector register reuse. The lattercan examine and compare the relationship between loop dependence distance and thevectorazation factor and implement optimization for the whole or part of vector register reusewhen there is loop-carried flow dependence or output dependence or input dependent. In addition,to exploration the parallelism between the SIMD functional units and the scalar functional units,the thesis propose a method of mixturation of vector and scalar parallellism. This methodunrolled the loop by segmentation and changed the order of execution of the statement in the loop,which depart the statement in the loop into vectorization part and scalar part. If there are no dependencies between the two parts, these two types of statements can be executed in parallel inthe SIMD vector function units and scalar functional units, which can improve the utilization ofsystem resources. The experimental results show that the vector register reuse optimizationcombined with incoming-cost analysis can improve the performance of vecorization code. Usingthe loop segmentation unrolling algorithm, the thesis realized the vector and scalar mixingparallellzation, which effectively improve performance of the generated code. The averagespeedup is increased by about12%.
     4、The thesis constructed the interaction-based framework of vectorizaton code performancetuning. The framework combines three parts of the vectorization tuning window interface, staticand feedback analysis and tuning for SIMD vectorization recognition and insertion the pragmastatement. Through the framework of vectorizaton code performance tuning, the thesis can getorganic combination of generation vectorized code by static analysis and performance tuning bydynamic feedback in the vectorization code tuning window interface, in conjunction with acomplete and specification vectorization compiler directives, which can effectively improve thegenerated vectorization code performance. The experimental results show that theinteraction-based vectorizaton code performance tuning framework can effectively enhance theperformance for some of the test programs in the SPEC CPU2000test suit. The maximumperformance acceleration can increase about50%and the average speedup by the optimization isincreased by about10%.
     Finally, the thesis test the overall SIMD vectorization recognition rate and generatedvectorization code performance of SW-VEC tool descripted in the dissertation. The experimentalresults show that the SW-VEC automatic vectorization recognition rate is better than the Intelcompiler version11.0, and the performance speedup is about16%higer than Intel compiler. Forthe test suit of high performance application, the speedup of interactive vetorization performancetuning has been relatively close to the manual rewrite vectorization program. The averageperformance speedup can be achieved the manually rewrite code speedup of more than90%,indicating that the framework of vetorization performance tuning is good for practice.

引文

[1] Intel Corporation. IA-32Intel@Architecture Software Developer’s Manual[EB/OL]. IntelCorporation.1997. http://developer.Intel.com.
    [2] Richard Gerber,Kevin B.Simith,Aart J.C.Bik,Xinmin Tian, The Software OptimizationCookbook:High-Performance Recips for IA-32Platforms(Second Edition) ISBN978-7-121-04005-4
    [3][美] Kai Hwang著，王鼎兴，沈美明，郑纬民，温冬蝉译.高等计算机系统结构(并行性、可扩展性、可编程性)[M].清华大学出版社.1995.8:441
    [4] Randy Allen and Kennedy著.张兆庆,乔如良,冯晓兵等译.现代体系结构的优化编译器[M].北京:机械工业出版社,2004:4.
    [5]姜伟华,梅超,郭一.一种针对多媒体扩展指令集和实际多媒体程序的自动向量化方法[J].计算机学报,2005,28(8):1254-1266
    [6] Open64. Overview of the open64Compiler Infrastructure[EB/OL].http://open64.sourceforge.net,2006.
    [7] K. Diefendorff, P. K. Dubey et al. Altivec extension to PowerPC accelerates mediaprocessing[J]. IEEE Micro, March-April2000,20(2):85-95
    [8] M. Wolfe and U. Banerjee. Data dependence and its application to parallel processing[J].International Journal of Parallel Processing,1987,16(2):137-178.
    [9] D. R. Wallace. Dependence of multi-dimensional array references[A]. In: Proceedings of theInternational Conference on Supercomputing[C], SaintMalo, France,1988:418-428.
    [10] Peter Westermann, Ludwig Schwoerer, Andre Kaufmann, Applying Data MappingTechniques to Vector DSPs, Journal of Signal Processing Systems, v.57n.1, p.57-72, October2009
    [11] Mark Hampton, Krste Asanovic, Compiling for vector-thread architectures, Proceedings ofthe sixth annual IEEE/ACM international symposium on Code generation and optimization,April05-09,2008, Boston, MA, USA
    [12] D. C. Lin,"Compiler support for predicated execution in superscalar processors," Master'sthesis, Department of ElectricM and Computer Engineering, University of Illinois, Urbana,IL,1992.
    [13]李玉祥.面向非多媒体程序的SIMD向量化方法及优化技术研究[D].合肥:中国科学技术大学博士学位论文,2008
    [14] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Efficient exploitation of parallelism onPentium III and Pentium4processor-based systems. Intel Technology J., February2001.
    [15] K. B. Smith, A. J. Bik, and X. Tian. Support for the Intel Pentium4Processor withHyper-threading Technology in Intel8.0Compilers. Intel Technology Journal,8(1), pages19--31, February2004.
    [16] Intel Corp."Intel C/C++and Intel Fortran Compilers for Linux",Information available athttp://www.Intel.com/software/products/compilers
    [17] Alex Peleg, Uri Weiser, MMX Technology Extension to the Intel Architecture, IEEE Micro,v.16n.4, p.42-50, August1996
    [18] Intel. Intel sse4,2006.http://download.Intel.com/technology/architecture/new-instructions-paper.pdf
    [19] Carlson D A, Castelino R W, Mueller R O. Multimedia extensions for a550MHZ RISCmicroprocessor[J]. IEEE Journal of Solid-State Circuits,1997,32(11):1618-1624
    [20] Motorola Inc. AltiVec Technology Programming Environments Manual,1998.
    [21] Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, Hunter Scales, AltiVec Extensionto PowerPC Accelerates Media Processing, IEEE Micro, v.20n.2, p.85-95, March2000
    [22] Crescent Bay Software Corp. http://www.psrv.com/vast altivec.html,2003
    [23] Portland Group Compiler Technology. Portland group compiler[EB/OL].http://www.pgroup.com/products/workpgi.htm,2003
    [24] Codeplay Software Limited. http://www.codeplay.com/vectorc/features.html.
    [25] Aart J. C. Bik, Milind Girkar, Paul M. Grey, Xinmin Tian. Automatic Intra-RegisterVectorization for the Intel Architecture. International Journal of Parallel Programming,Vol.30, No2,65-98, April2002.
    [26] Alexandre E. Eichenberger, Kathryn O’ Brien, et al. Optimizing Compiler for the CELLProcessor[C]. PACT2005.
    [27] Cheong G, Lam M S. An optimizer for multimedia instruction sets[C]. Proceedings of the2nd SUIF Compiler Workshop, Stanford University.1997
    [28] Krall A, Lelait S. Compilation techniques for multimedia processors[J]. International Journalof Parallel Programming,2000,28(4):347-361
    [29] Sreraman N, Govindarajan R. A vectorizing compiler for multimedia extensions[J]. Int. J.Parallel Program.,2000,28(4):363-400
    [30] Boosting the performance of multimedia applications by using SIMD instructions[C].Proceedings of14th International Conference on Compiler Construction (CC). Edinburgh:Springer-Verlag,2005:59-75
    [31] Jiahua Zhu, HongJiang Zhang, Hui Shi, Binyu Zang, Chuanqi Zhu: Overflow ControlledSIMD Arithmetic. Proc.of17th LCPC, West Lafayette, Indiana, September2004:Springer-verlag.424-438
    [32] Randy Allen,Ken Kennedy, Automatic Translation of Fortran Programs to Vector Form,ACM Trans.On Programming Languages and Systems,1987,9(4):491-542.
    [33] Samuel Larsen, Saman Amarasinghe. Exploiting Superword Level Parallelism withMultimedia Instruction Sets. In Proc.Of the ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, Jun2000,page145-156.
    [34] Boekhold M, Karkowski I, Corporaal H. Transforming and parallelizing ANSI C programsusing pattern recognition[C]. Lecture Notes in Computer Science.1999.
    [35] Manniesing R, Karkowski I, Corporaal H. Automatic SIMD parallelization of embeddedapplications based on pattern recognition[C]. proceedings of6th International Euro-ParConference.2000:349-356.
    [36] Samuel Larsen and Saman Amarasinghe. Exploiting Superword Level Parallelism withMultimedia Instruction Sets. In Proceedings of the SIGPLAN '00Conference onProgramming Language Design and Implementation, pages145{156, Vancouver, BC, June2000.
    [37] Rainer Leupers. Code selection for media processors with SIMD instructions[J]. Design,Automation and Test in Europe Conference and Exhibition2000, Proceedings:4-8
    [38]王迪. SIMD编译优化技术研究[D].杭州:浙江大学,2008
    [39] Aart J. C. Bik, Software Vectorization Handbook, The: Applying Intel MultimediaExtensions for Maximum Performance, Intel Press,2004
    [40] IBM XL C/C++and Fortran compilers. http://www-306.ibm.com/software/awdtools/xlcpp/.
    [41] Franz Franchetti, Stefan Kral, Juergen Lorenz, and Christoph W. Ueberhuber. Efficientutilization of SIMD extensions. Proceedings of the IEEE,93(2):409--425,2005.
    [42] Free Software Foudation, GCC, http://gcc.gnu.org.
    [43] Dorit Naishlos, IBM Research Lab in Haifa, Autovectorization in GCC.GCC Developers'Summit2004105-117.
    [44] Pathscale Compiler User’s Guide.2.0http://www.pathscale.com/.
    [45] The Portland Group Compiler Technology. PGI Users Guide: Parallel Fortran, C and C++for Scientists and Engineers,2004.
    [46] L.Bachega，S.Chatterjee，K.Dockser，J.Gunnels，M.Gupta，F.Gustavson，C.Lapkowski，G.Liu，M.Mendell，C.Wait，T.J.C. Ward. A High-Performance SIMD Floating Point Unit forBlueGene/L:Architecture，Compilation，and Algorithm Design．Parallel Architectue andCompilation Techniques(PACT2004)，Antibes Juan-les-Pins，France，Sept-Oct2004.
    [47] G.Cheong and M.S.Lam, An optimizer for multimedia instruction sets. In The Second SUIFCompiler Workshop, Stanford University, USA, August1997.
    [48] Rashindra Manniesing, Ireneusz Karkowski, Henk Corporaal, Automatic SIMDParallelization of Embedded Applications Based on Pattern Recognition, Proceedings fromthe6th International Euro-Par Conference on Parallel Processing, p.349-356, August29-September01,2000
    [49] Weihua Jiang, Chao Mei, Bo Huang, Jianhui Li, Jiahua Zhu, Bingyu Zang, Chuanqi Zhu"Boosting the Performance of Multimedia Applications Using SIMD Instructions" The15thInternational Conference on Compiler Construction. April2005Edinburgh, Scotland
    [50] M.Wolfe and Chau-Wen Tseng. The Power Test for Data Dependence[J]. IEEE Transactionson Parallel and Distributed Systems,1992,3(5):591-601
    [51] Jaewook Shin, Mary Hall, Jacqueline Chame. Superword-Level Parallelism in the Presenceof Control Flow[C]. CGO2005
    [52]张宏江,臧斌宇,朱传琪.多媒体程序中消除控制相关的技术研究.计算机工程与科学,Vo1.28,No.11,2006
    [53]吴圣宁,李思昆.多媒体处理器的SIMD代码生成.计算机科学,2007.
    [54] Samuel Larsen. Compilation Techniques for Short-Vector Instructions[D].MASSACHUSETTS INSTITUTE OF TECHNOLOGY,2006
    [55] Ren G, Wu P,, Padua D. A preliminary study on the vectorization of multimediaapplications[C].16th International Workshop of Languages and Compilers for ParallelComputing.2003
    [56] Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMDArchitectures with Alignment Constraints. In Proceedings of the SIGPLAN '04Conferenceon Programming Language Design and Implementation, pages82{93, Washington,DC, June2004.
    [57] Franchetti F, Puschel M. A SIMD vectorizing compiler for digital signal processingalgorithms[C]. Proceedings of the16th International Symposium on Parallel and DistributedProcessing. IEEE Computer Society,200220.2
    [58] Dorit Nuzman, Ira Rosen, Ayal Zaks, Auto-vectorization of interleaved data for SIMDdevices. Proceedings of the2006ACM SIGPLAN conference on Programming languagedesign and implementation, ACM Press,2006:132-143
    [59] CHANG CHIN-YUNG, CHEN TZUNG-SHI, SHEU JANG PING.Improving MemoryTraffic by Assembly-level Exploitation of Reuses for Vector Registers [J].The Journal ofSupercomputing,2000,17(2):187-204.
    [60] Qian Xinglong, ZANG Binyu, ZHU Chuanqi. Partial Reuse of the VectorRegisters in SIMDOptimization. Computer Engineering&Science, V01.29, No.5,2007.
    [61] Jaewook Shin, Jacqueline Chame, and Mary W. Hall. Compiler-Controlled Caching inSuperword Register Files for Multimedia Extension Architectures. In Proceedings ofInternational Conference on Parallel Architectures and Compilation Techniques, September2002
    [62] Cost model implementation in GCC4interacting with ASM. http://www.hitech-projects.com/euprojects/ACOTES/deliverables/acotes-d4.3-final.pdf.
    [63]Kathryn S. Mckinley, Steve Carr and Chau-Wen Tseng.Improving data locality with looptransformations, ACM Transactions on Programming Languages and Systems,Vol.18,No.4,July1996,Pages424-453.
    [64] ME Wolf. Improving locality and parallelism in nested loops[D]. Standford University,1992.
    [65] Bik A J C, Girkar M, Grey P M, Tian X. Experiments with automatic vectorization for thePentium(R)4processor[C]. Compilers for Parallel Computers.2001
    [66] Bik A J C, Girkar M, Grey P M, Tian X. Efficient exploitation of parallelism on Pentium IIIand Pentium4processor-based systems[C]. Intel Technology2003
    [67]Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks and Ira Rosen.Polyhedral-Model Guided Loop-Nest Auto-Vectorization. IBM Haifa Research Lab,2009.
    [68]陈文光，杨博，王紫瑶，郑丰宙，郑纬民.一个交互式的Fortran77并行化系统[J].软件学报，1999.12，10(12)，pp1259-1267.
    [69] Barbara Chapman,Oscar Hernandez,LeiHuang,Tien-hsiung Weng, Zhenying liu,LaksonoAdhianto,Yi Wen.Dragon:An Open64-Based Interactive Program Analysis Tool for LargeApplications. http://www.cs.uh.edu/~dragon.
    [70] OpenUH. http://www.cs.uh.edu/openuh.
    [71] Daniel von Dincklage, Amer Diwan Department of Computer Science University ofColorado. Explaining Failures of Program Analyses. PLDI2008Tucson,Arizona USA.
    [72] Google Daniel von Dincklage, Amer Diwan Department of Computer Science Universityof Colorado. Optimizing Programs with Intended Semantics. OOPSLA’09, October25–29,2009, Orlando, Florida, USA
    [73] Samples, Alan Dain. Profile-Driven Compilation[D]. Computer Science Dept Univ. ofCalifornia, Berkeley, Apr.1991.
    [74] Robert G. Burger, R. Kent Dybvig. An infrastructure for profile-driven dynamicrecompilation[C]. ICCL’98.1998
    [75] John Whaley, Christos Kozyrakis. Heuristics for profile-driven method-level speculativeparallelization[C]. Proceedings of the2005International Conference on Parallel Processing.p.147-156. June14-17.2005.
    [76] Tong Chen, Chu-Cheow Lim, et al. Alias and dependence profiling in ORC and theirapplications[R]. MRL Intel.
    [77] Oscar Hernandez, et al. Performance Instrumentation and Compiler Optimizations forMPI/OpenMP Applications[C]. IWOMP2006.2006.
    [78] Georgios Tournavitis et al. towards a holistic approach to auto-parallelization Integratingprofile-driven parallelism detection and machine-learning based mapping[C]. PLDI’09, June15-20,2009,Dublin,Ireland.
    [79]尉红梅,姚建华.并行语言及编译技术现状和发展趋势[J].计算机工程,2004,30(S1):97-98
    [80] Keith Cooper A D, Kennedy K. Vizer:A system to vectorize Intel x86binaries[C].Proceedings of the Third Annukal Symposium of the Los Alamos Computer Science Institute,2002
    [81] Bulic P, Gusin V. An extended ANSI C for Processors with a multimedia extension[J]. Int. J.Parallel Program,2003,31(2):107-136
    [82] Keith D. Cooper, Linda Torczon. Engineering a compiler[D]. Morgan Kaufmann,2002
    [83] C.Ding and K.kennedy.Inter array data regrouping LCPC1999
    [84] Xipeng shen,Yaoqing Gao Lightweight reference affinity ayanisys ICS2005
    [85] U.Kremer Automatic data layout for distributed memory machine. PHD thesis1995
    [86]李玉祥,施慧,陈莉.面向向量化的局部数据重组[J].小型微型计算机系统,2009,30(8):1528-1534
    [87] Intel. Intel Core i7,2008. gttp://www.Intel.com/products/processor/corei7/index.htm.
    [88] Kapasi U J.Efficient conditional operations for data-parallel architectures[C]//Proc.33rdIEEE/ACM Int’lSymp. Microarchitecture. USA: Monterey. IEEE CS Press.2000
    [89] Bjorn Franke and Michael O’boyle. Array recovery and high-level transformations for dspapplications. Trans. On Embedded Computing Sys.,2(2):132–162,2003.
    [90] Zhong Y,Orlovich M,Shen X,et al.Array regrouping and structure splitting usingwholeprogram reference affinity.Procee-dings of PLDI’04.June2004:255-266.
    [91] Hagog M, Tice C.Cache Aware Data Layout Reorganization Optimization inGCC.Proceedings of the GCC Developers’Summit.June2005:69-92.
    [92] M. Hind, M. Burke, P. Carini, and J.-D. Choi. Interprocedural pointer alias analysis. ACMTransactions on Programming Languages and Systems,21(4):848{894,1999.
    [93] I.Pryanishnikov, A.Krall, and N.Horspool. Pointer Alignment Analysis for Processors withSIMD Instructions. In Proc, of the5th Workshop on Media and Streaming Processors atMicro’03, pages50-57, December2003.
    [94] Peng Wu, Alexandre E. Eichenberger. Amy Wang. Efficient SIMD Code Generation forRuntime Alignment and Length Conversion[C]. CGO2005.
    [95] Chilimbi T M, Davidson B, Larus J R.Cache-conscious Structure Definition[C]. Inproceedings of PLDI’99.May1999:13-24.
    [96] Hagog M, Tice C.Cache Aware Data Layout Reorganization Optimization in GCC[C].Inproceedings of the GCC Developers’Summit. June2005:69-92.
    [97] PryanishnikoV，A Krall, Poiter Alignment Analysis for ProcessorS with SlMD lnstmction.InProc0f the5th Workshop of Media streaming Processor,December,page50-57
    [98] Alex Aletà, Josep M. Codina, F. Jesús Sánchez, Antonio González, David R. Kaeli,Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning, Proceedings ofthe2002International Conference on Parallel Architectures and Compilation Techniques,p.281-290, September22-25,2002
    [99] David Kuck. Structure of Computers and Computations[J]. John Wiley&Sons, Inc,1978
    [100] D. E. Maydan, J. L. Hennessy, M. S. Lam. Efficient and exact data dependence analysis[A].In: Proceedings of the ACM SIGPLAN1991Conference on Programming Language Designand Implementation[C], Toronto, Ontario, Canada,1991:1-14.
    [101] Peng Wu, Albert Cohen, Jay Hoeflinger, and David Padua.Monotonic evolution: Analternative to induction variable substitution for dependence analysis. In Proceedings of the15th International Conference on Supercomputing, pages78–91. ACM Press,2001.
    [102] Zima H.P., Chapman B.M. Supercompilers for Parallel and VectorComputers.Addison-Wesley Publishing Company,1990.
    [103] K.Kennedy and K.McKinley. Maximizing loop parallelism and improving data locality vialoop fusion and distribution. In U.Banerjee, D.Gelernter,A.Nicolau, and D.Padua, editors,Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science,Number768, pages301-320, Springer-Verlag,Berlin,1993
    [104] M. Kandemir, I. Kadayif, A. Choudhary, and J. A. Zambreno. Optimizing internest datalocality. In PACT, pages127--135,2002.
    [105] Nancy J. Warter，Scott A. Mahlke. Reverse If-Conversion. PLDI '93Proceedings of theACM SIGPLAN1993conference on Programming language design and implementation，ACM New York, NY, USA
    [106] J.R.Allen, L.Kennedy, C.Porterfield, and J.Warren. Conversion of control dependence todata dependence. In Conferrence Record of the Tenth Annual ACM Symposium on thePrinciples of Programming Languages, January1983.
    [107] Kenkennedy, KathrynS.McKinley. Loop Distribution with Arbitrary Control Flow.RiceUniversity Department of Computer Science
    [108] Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodik, and Kemal Ebcioglu.Programming by sketching for bit-streaming programs. In PLDI’05: Proceedings of the2005ACM SIGPLAN Conference on Programming Language Design and Implementation, pages281–294. ACM Press,2005.
    [109] Markus Puschel, Jose M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, BryanW. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen,Robert W. Johnson, and Nick Rizzolo. SPIRAL: Code generation for DSP transforms.Proceedings of the IEEE,93(2):232–275,2005.
    [110] Dorit Naishlos, Marina Biberstein, Shay Ben-David, Ayal Zaks, Vectorizing for a SIMdDDSP architecture, Proceedings of the2003international conference on Compilers,architecture and synthesis for embedded systems, October30-November01,2003, San Jose,California, USA
    [111] Alan Leung, Ond ej Lhoták, Ghulam Lashari, Automatic parallelization for graphicsprocessing units, Proceedings of the7th International Conference on Principles and Practiceof Programming in Java, August27-28,2009, Calgary, Alberta, Canada
    [112] Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, Markus Püschel, AutomaticSIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets,Proceedings of the international conference on Supercomputing, May31-June04,2011,Tucson, Arizona, USA
    [113] Hiroaki Tanaka, Yutaka Ota, Nobu Matsumoto, Takuji Hieda, Yoshinori Takeuchi,Masaharu Imai, A new compilation technique for SIMD code generation across basic blockboundaries, Proceedings of the2010Asia and South Pacific Design Automation Conference,January18-21,2010, Taipei, Taiwan
    [114] Asadollah Shahbahrami, Ben Juurlink, Stamatis Vassiliadis, Versatility of extendedsubwords and the matrix register file, ACM Transactions on Architecture and CodeOptimization (TACO), v.5n.1, p.1-30, May2008
    [115] JongSoo Park, Sung-Boem Park, James D. Balfour, David Black-Schaffer, ChristosKozyrakis, William J. Dally, Register pointer architecture for efficient embedded processors,Proceedings of the conference on Design, automation and test in Europe, April16-20,2007,Nice, France
    [116] Randall J. Fisher and Henry G. Dietz. Compiling for simd within a register. In Processingsof11th International Workshop on Languages and Compilers for Parallel Processing, pages290–304,1998.
    [117] Tom Henretty, Kevin Stock, Louis-No l Pouchet, Franz Franchetti, J. Ramanujam, P.Sadayappan, Data layout transformation for stencil computations on short-vector SIMDarchitectures, Proceedings of the20th international conference on Compiler construction:part of the joint European conferences on theory and practice of software, March26-April03,2011, Saarbrücken, Germany
    [118] Matthew Allan Postiff, Trevor Mudge, Compiler and microarchitecture mechanisms forexploiting registers to improve memory performance,2001
    [119] William Y. Chen, Roger A. Bringmann, Scott A. Mahlke, Richard E. Hank, James E.Sicolo, An efficient architecture for loop based data preloading, Proceedings of the25thannual international symposium on Microarchitecture, p.92-101, December01-04,1992,Portland, Oregon, United States
    [120] Gautam Doshi, Rakesh Krishnaiyer, Kalyan Muthukumar, Optimizing Software DataPrefetches with Rotating Registers, Proceedings of the2001International Conference onParallel Architectures and Compilation Techniques, p.257-267, September08-12,2001
    [121]Loh, G.,“Exploiting Data-Width Locality to Increase Superscalar Execution Bandwidth”, inMICRO-35,2002.
    [122]杜静.流体系结构的编译技术研究[D].长沙:国防科学技术大学博士学位论文,2008
    [123] Farkas, K., Chow, P., Jouppi, N., Vranesic, Z.,“The Multicluster Architecture: ReducingCycle Time Through Partitioning”, in Proc. of MICRO-30,1997.
    [124] Loh, G.,“Exploiting Data-Width Locality to Increase Superscalar Execution Bandwidth”,in MICRO-35,2002.
    [125] Moudgill, M., Pingali, K., Vassiliadis, S.,"Register Renaming and Dynamic Speculation:An Alternative Approach", in Proc. of MICRO-26,1993.
    [126] Balakrishnan, S., Sohi, G.,“Exploiting Value Locality in Physical Register Files”, in Proc.of MICRO-36,2003.
    [127] Kim, N., Mudge, T.,"Reducing Register Ports Using Delayed Write-Back Queues andOperand Pre-Fetch", in ICS,2003.
    [128]张宏江.针对多媒体应用的SIMD编译优化技术研究[D].上海:复旦大学,2006
    [129] Aart J.C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Automatic intra-registervectorization for the Intel architecture. International Journal of Parallel Programming,30(2):65–98,2002.
    [130] Randy Allen, Ken Kennedy, Automatic Loop Interchange.20Years of the ACM/SIGPLANConference on Programming Language Design and Implementation (1979-1999): ASelection,2003.
    [131] Keith D. Cooper, Ken Kennedy, Nathaniel McIntosh, Cross-Loop Reuse Analysis and ItsApplication to Cache Optimizations, Proceedings of the9th International Workshop onLanguages and Compilers for Parallel Computing, p.1-19, August08-10,1996
    [132] Gayathri Krishnamurthy, Elana D. Granston, Eric J. Stotzer, Affinity-based clusterassignment for unrolled loops, Proceedings of the16th international conference onSupercomputing, June22-26,2002, New York, New York, USA
    [133] Ming Yang, Yuan Yao, Shuai Wei, Yuanyuan Zhang, Lei Huang. A Technology BasedBenefit Analysis on Reuse of Vector Register for SIMD Vectorization Optmization[C]. Inproceeding of ISISE. SHANGHAI.2010.101-104.
    [134] ME Wolf. Improving locality and parallelism in nested loops[D]. Standford University,1992.
    [135] C. Ca caval, S. Chatterjee, H. Franke, K. J. Gildea, P. Pattnaik, A taxonomy ofaccelerator architectures and their programming models, IBM Journal of Research andDevelopment, v.54n.5, p.473-482, September2010
    [136] Chunyang Gou, Georgi Kuzmanov, Georgi N. Gaydadjiev, SAMS multi-layout memory:providing multiple views of data to boost SIMD performance, Proceedings of the24th ACMInternational Conference on Supercomputing, June02-04,2010, Tsukuba, Ibaraki, Japan
    [137] Zheng Wang, Michael F.P. O'Boyle, Partitioning streaming parallelism for multi-cores: amachine learning based approach, Proceedings of the19th international conference onParallel architectures and compilation techniques, September11-15,2010, Vienna, Austria
    [138]Sylvain Girbal, Nicolas Vasilache, et al. Semi-automatic composition of looptransformations for deep parallelism and memory hierarchies. International Journal ofParallel Programming,2006.
    [139] Kevin Stock, Louis-No l Pouchet, P. Sadayappan, Using machine learning to improveautomatic vectorization, ACM Transactions on Architecture and Code Optimization (TACO),v.8n.4, p.1-23, January2012
    [140] Manuel Hohenauer, Felix Engel, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, A SIMDoptimization framework for retargetable compilers, ACM Transactions on Architecture andCode Optimization (TACO), v.6n.1, p.1-27, March2009
    [141] Hyesoon Kim, José A. Joao, Onur Mutlu, Yale N. Patt, Profile-assisted Compiler Supportfor Dynamic Predication in Diverge-Merge Processors, Proceedings of the InternationalSymposium on Code Generation and Optimization, p.367-378, March11-14,2007
    [142]顾丽红,吴少剐,章隆兵,蔡飞.针对非规则应用的OpenMP制导扩展[J].小型微型计算机系统,2005,26(1):124-128
    [143]英特尔亚太研发有限公司、北京并行科技公司，释放多核潜能——英特尔ParallelStudio并行开发指南[M]，清华大学出版社，2010.9：12

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700