面向共享存储结构的并行编译优化技术研究

英文题名：Research on Automatic Parallelization and Optimization Technologies for Shared Memory Architecture
作者：刘晓娴
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：并行化编译 ; 共享存储结构 ; OpenMP编程模型 ; 流水并行 ; PS-DSWP并行 ; 代价模型 ; 数据传输优化
英文关键词：Parallel Compilation ; Shared Memory Architecture ; OpenMP Programming Model ; Pipelining Parallelization ; PS-DSWP Parallelization ; Cost Model ; Data Transfer
英文关键词：Optimization
学位年度：2013
导师：赵荣彩
学科代码：0812
学位授予单位：解放军信息工程大学
论文提交日期：2013-04-15

摘要

在计算机体系结构的发展过程中,并行结构的出现与不断发展将高性能计算机系统的峰值速度一次又一次推向新的高峰。但与硬件的峰值性能相比,用户程序所能获得的持续性能相去甚远,其中一个主要原因是并行程序设计带来的挑战。程序的自动并行化是实现并行程序设计的一条有效途径,编译器通过对串行程序中蕴含并行性的分析与发掘,自动生成适合并行体系结构运行的并行程序。自动并行化编译技术对于继承现有的软件财富,促进高性能计算机的应用具有重要作用。
     共享存储结构在高性能计算机体系结构中占据着重要地位,面向共享存储结构的并行化编译技术经过几十年的发展,已较为成熟。但是,要实现共享存储平台上高效并行代码的自动生成,仍面临若干技术挑战,如：存在跨迭代依赖循环的有效并行；自动并行化过程中程序并行收益的精确评估；异构平台上多层次存储系统的有效使用。本文以并行编译器SW-VEC的研发为背景,探讨了面向共享存储结构的并行编译优化技术,主要贡献和创新包括：
     1、提出了一种基于OpenMP的规则DOACROSS循环流水并行代码自动生成和流水粒度优化算法,设计实现了计算划分层和循环分块层的启发式选择算法,有效提高了规则DOACROSS循环的自动并行性能。
     2、提出了一种基于OpenMP的PS-DSWP自动并行改进算法,以基本块而非指令作为构建程序依赖图的基本单位,增大了并行的粒度；使用OpenMP应用编程接口实现并行时线程之间的任务分配和数据共享,有效实现了PS-DSWP算法的应用扩展和目标代码的性能提升。
     3、建立了一种新型的OpenMP代价分析模型,采用模块化和层次化的策略,将模型分为循环执行模型和硬件模型两个层次,既能灵活地实现模型扩展,又便于移植和运用于不同的目标体系结构。
     4、提出了一种基于多面体模型和精确数组区域表示的数据传输优化方法,设计了一组实现异构平台上数据传输控制的OpenMP扩展子句,定义了分块规则数组区域及其合并操作实现数组区域的精确表示,提升了异构平台中多层次存储系统的使用效率。
     本文提出的算法和模型已在并行编译器SW-VEC中得到了实现和应用,验证了算法的正确性和高效性。
The peak speeds of high performance computer are pushed to new pinnacle with the use and innovations of parallel architectures again and again. However, the difficulty of programming parallel architectures is a huge challenge for programmers. Among several approaches to address the parallel programming problem, one that is very promising but simultaneously very challenging is automatic parallelization. Automatic parallelization is the process of automatically converting a sequential program to a version that can directly run on multiple processing elements without altering the semantics of the program. This process requires little effort from programmers and is therefore very attractive.
     Shared memory architectures always play an important role during the course of high performance architecture development. The automatic parallelization techniques for shared memory architecture have been widely studied in the domain of scientific and numeric applications, but there are still many challenges in automatic generation of high performance parallel programs for shared memory architectures, such as parallelization of loops with inter-iteration dependences, cost model used for profit estimation and data transfer optimization for heterogeneous platform.
     This dissertation, based on the development and research of an automatic parallelizing compiler SW-VEC, focuses on the issues of the automatic parallelization and optimization technologies for shared memory architecture; the main contribution and innovation of this dissertation are as follows:
     1. A pipelining granularity optimizing algorithm based on DOACROSS cost model is proposed to obtain the optimal pipelining granularity, and an automatic pipelining parallelization algorithm for regular DOACROSS loops based on OpenMP is proposed for shared memory parallel platforms. By using above algorithms, automatic generation of effective pipelining code is implemented.
     2. An improved PS-DSWP algorithm based on OpenMP is proposed, which is implemented without relying on CPU architectures by using a high level intermediate representation. Moreover, the Program Dependence Graph (PDG) used in the algorithm is built based on the basic blocks, which exploits coarser-grained parallelism than the original PS-DSWP transformation with PDG based on instructions. OpenMP is employed in our algorithm to assign task and implement synchronization among threads while avoiding platform limitation.
     3. A compile-time cost model for automatic parallelization profit estimation based on a modularized and hierarchized strategy is built, which is partitioned into two hierarchies. The major benefit of the modularized and hierarchized strategy is that the design and implementation could be easily achieved and the resulting model should be flexible and extensible.
     4. An approach of managing data storage and data transfer between the main memory and local memory is proposed by designing a potential extension to OpenMP. Meanwhile, blocking regular array region and its union operation are defined to describe the set of transferred array data, and develop a method to compute the array region by applying the polyhedral model.
     The algorithms and cost model presented in the dissertation have been carried out and applied in the SW-VEC system, and the validity has been proved.

引文

[1]陈国良.并行计算——结构算法编程(修订版)[M].北京：高等教育出版社,2003.
    [2]沈志宇,胡子昂.并行编译方法[M].北京：国防工业出版社,2000：5-10.
    [3]Simone Campanoni, Timothy Jones, Glenn Holloway. HELIX:Automatic Parallelization of Irregular Programs for Chip Multiprocessing [A]. In:Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization [C].2012:84-93.
    [4]R. Allen, and K, Kennedy. Optimizing compilers for modern architectures:a dependence-based approach[M].1st ed., California:Morgan Kaufmann Publisher,2001:63-68.
    [5]R. Cytron. DOACROSS:Beyond vectorization for multiprocessors [A]. In Proceedings of the International Conference on Parallel Processing [C].1986:836-884.
    [6]Peter Thoman, Herbert Jordan, Simone Pellegrini, et al.. Automatic OpenMP loop scheduling:a combined compiler and runtime approach[A]. Proceedings of 8th International Workshop on OpenMP [C], Rome, 2012:88-101.
    [7]Easwaran Raman, Guilherme Ottoni, Arun Raman, et al.. Parallel-Stage decoupled software pipelining [A]. In:Proceedings of the 6th Annual IEEE/ACM International Symposium on Code generation and optimization [C], New York,2008:114-123.
    [81多核系列教材编写组.多核程序设计[M].北京：清华大学出版社,2007.
    [9]Brodtkorb A R, Dyken C, Hagen T R, et al. State-of-the-art in heterogeneous computing [J]. Sci. Program. 2010,18:1-33.
    [10]Hofstee H P. Power Efficient Processor Architecture and The Cell Processor [A], In Proceedings of the 11th International Symposium on High-Performance Computer Architecture [C]. Washington, DC, USA, 2005:258-262.
    [11]Shreekant Thakkar T. The internet streaming SIMD extensions [J]. Intel Technology Journal.1999, Q2.
    [12]SHIMPI A. Intel's Sandy Bridge Architecture Exposed [EB/OL]. http://tinyurl.com/SandyBridgeArch.
    [13]AMD. AMD White Paper:AMD Fusion Family of APUs [EB/OL]. http://sites.amd.com /us/Documents/48423B_fusion_whitepaper_WEB.pdf.
    [14]张宏江.针对多媒体应用的SIMD编译优化技术研究[D].上海：复旦大学,2006.
    [15]Shameem Akhter, Jason Roberts. Multi-core programming:increasing performance through software multi-threading [M]. Hillsboro:Intel Press,2006:13-27.
    [16]王桂彬.大规模异构并行系统软件低功耗优化关键技术研究[D].湖南：国防科学技术大学,2011.
    [17]Kogge P, Bergman K, Borkar S, et al. Exascale computing study:Technology challenges in achieving exascale systems [R].2008.
    [18]Yang X, Liao X, Xu W, et al. TH-1:China's first petaflop supercomputer [J]. Frontiers of Computer Science in China.2010,4 (4):445-455.
    [19]Top500. http://www.top500.org.
    [20]Green500. http://www.green500.org.
    [21]DARPA. High Productivity Computing Systems (HPCS) [EB/OL]. http://www.darpa.mil /ipto/programs/hpcs/index.htm,2004.
    [22]Kai Hwang. Advanced Computer Achitecture Parallel, Scalability, Programmability [M]. McGraw-Hill, 1993.
    [23]Open64[OL], http://open64.sourceforge.net.
    [24]Samples, Alan Dain. Profile-Driven Compilation[D]. Computer Science Dept Univ. of California, Berkeley, Apr.1991.
    [25]Robert G. Burger, R. Kent Dybvig. An infrastructure for profile-driven dynamic recompilation[C]. ICCL'98.1998
    [26]John Whaley, Christos Kozyrakis. Heuristics for profile-driven method-level speculative parallelization[C]. Proceedings of the 2005 International Conference on Parallel Processing, p.147-156. June 14-17.2005.
    [27]Tong Chen, Chu-Cheow Lim, et al. Alias and dependence profiling in ORC and their applications[R]. MRL Intel.
    [28]Oscar Hernandez, et al. Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications[C]. IWOMP 2006.2006.
    [29]Georgios Tournavitis et al. towards a holistic approach to auto-parallelization Integrating profile-driven parallelism detection and machine-learning based mapping[C]. PLDI'09, June 15-20,2009,Dublin,Ireland.
    [30]陈文光,杨博,王紫瑶,郑丰宙,郑纬民.一个交互式的Fortran77并行化系统[J].软件学报,1999.12,10(12),ppl259-1267.
    [31]Barbara Chapman,Oscar Hernandez,LeiHuang,Tien-hsiung Weng, Zhenying liu,Laksono Adhianto,Yi Wen.Dragon:An Open64-Based Interactive Program Analysis Tool for Large Applications [EB/OL]. http://www.cs.uh.edu/-dragon.
    [32]OpenUH. http://www.cs.uh.edu/~openuh.
    [33]Daniel von Dincklage, Amer Diwan Department of Computer Science University of Colorado. Explaining Failures of Program Analyses [A]. PLDI2008 [C]. Tucson,Arizona USA.
    [34]Google Daniel von Dincklage, Amer Diwan Department of Computer Science University of Colorado.Optimizing Programs with Intended Semantics [A]. OOPSLA'09 [C]. October 25-29,2009, Orlando, Florida, USA
    [35]M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions [C]. In PLDI. ACM,2007.
    [36]Dimitrios Prountzos, Roman Manevich, Keshav Pingali, et al.. A Shape Analysis for Optimizing Parallel Graph Programs [C]. In POPL 2011.
    [37]M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Scheduling strategies for optimistic parallel execution of irregular programs [C]. In SPAA'08,2008.
    [38]Donald Nguyen, Keshav Pingali. Synthesizing Concurrent Schedulers for Irregular Algorithms [C]. In ASPLOS 2011.
    [39]Simone Campanoni, Timothy Jones, Glenn Holloway, et al.. HELIX:Automatic Parallelization of Irregular Programs for Chip Multiprocessing [C]. In CGO 2012.
    [40]Bin Ren, Gagan Agrawal, James R. Larus, et al.. Fine-Grained Parallel Traversals of Irregular Data Structures [C]. In PACT'12, Minneapolis, Minnesota, USA.2012:461-462.
    [41]Yun Zou, Sanjay Rajopadhye. Automatic Parallelization of "Inherently Sequential" Nested Loop Programs [R]. Technical Report CS-11-102,2011.
    [42]Arquimedes Canedo, Takeo Yoshizawa, Hideaki Komatsu. Automatic Parallelization of Simulink Applications [C]. In CGO'10, Toronto, Ontario, Canada,2010.
    [43]Min Feng, Rajiv Gupta, Iulian Neamtiu. Effective Parallelization of Loops in the Presence of I/O Operations [C]. In PLDI'12, Beijing, China,2012.
    [44]Hanjun Kim, Nick P. Johnson, Jae W. Lee. Automatic Speculative DO ALL for Clusters [C]. In International Symposium on Code Generation and Optimization,2012.
    [45]Zhijia Zhao. Speculative parallelization needs rigor-probabilistic analysis for optimal speculation of finite-state machine applications [C]. In PACT 2012.
    [46]Nick P. Johnson, Hanjun Kim, Prakash Prabhu. Speculative Separation for Privatization and Reductions [C]. In PLDI'12, June 11-16, Beijing, China,2012.
    [47]Nikolas Ioannou, Marcelo Cintra. Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism [C]. In MICRO'11, Porto Alegre, Brazil,2011.
    [48]Chen Tian, Changhui Lin, Min Feng, et al.. Enhanced Speculative Parallelization Via Incremental Recovery [C]. In PPoPP'11, San Antonio, Texas, USA,2011.
    [49]Chen, D. K. and Yew, P. C. An empirical study on DOACROSS loops[A], Proceedings of Supercomputing [C].1991:620-632.
    [50]马琳.反馈指导的流水计算性能调优[D].北京：中国科学院计算技术研究所,2005.
    [51]Guilherme Ottoni, Ram Rangan, Adam Stoler, et al.. Automatic thread extraction with decoupled software pipelining [A]. Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture [C]. Washington, DC,2005:105-118.
    [52]Ram Ramshankar. Open64 Compiler Developer Guide[EB/OL]. http://developer.amd.com /tools/cpu/open64/Documents/open64_compiler_developer_guide.html.2009.12.
    [53]Midkiff, S.P., Padua, D.A.:Compiler algorithms for synchronization [J]. IEEE Transactions on computers C36,1987:1485-1495.
    [54]Wolfe, M.:Multiprocessor synchronization for concurrent loops [J]. IEEE Software,1988:5(1),34-42.
    [55]Su, H.M., Yew, P.C.:On data synchronization for multiprocessors [A]. In:Proc. of the 16th Annual International Symposium on Computer Architecture[C]. Jerusalem, Israel,1989:416-423.
    [56]Li, Z.:Compiler algorithms for event variable synchronization [A]. In:Proceedings of the 5th International Conference on Supercomputing [C]. Cologne, West Germany,1991:85-95.
    [57]Tang, P., Yew, P., Zhu, C.:Compiler techniques for data synchronization in nested parallel loop [A]. In: Proc. of 1990 ACM Intl. Conf. on Supercomputing [C], Amsterdam,1990:177-186.
    [58]Krothapalli, V.P., Sadayappan, P.:Removal of redundant dependences in doacross loops with constant dependencesC [J]. IEEE Transactions on Parallel and Distributed Systems,1991:281-289.
    [59]Rajamony, R., Cox, A.L.:Optimally synchronizing doacross loops on shared memory multiprocessors [A]. In:Proc. of Intl. Conf. on Parallel Architectures and Compilation Techniques [C].1997.
    [60]D-K. Chen and P-C. Yew. Statement re-ordering for DOACROSS loops [C]. ICPP,1994.
    [61]D-K. Chen and_ P-C. Yew. On effective execution of nonuniform DOACROSS loops [J]. IEEE Transactions on Parallel and Distributed Systems,1996.
    [62]D-K. Chen and P-C. Yew. Redundant synchronization elimination for DOACROSS loops. Parallel and Distributed Systems [J], IEEE Transactions on,10(5), May 1999.
    [63]D.-K. Chen, Torrellas, J., Yew, P.C.:An efficient algorithm for the run-time parallelization of doacross loops [A]. In:Proc. Supercomputing [C].1994:518-527.
    [64]J. Thulasiraman, V. P. Krothapalli, M. Giesbrecht. Run-time Parallelization of Irregular Doacross Loops [J]. Lecture Notes in Computer Science,1995:75-80.
    [65]C.-Z. Xu and V. Chaudhary. Time stamp algorithms for runtime parallelization of DOACROSS loops with dynamic dependences. TPDS,2001.
    [66]Pan, Z., Armstrong, B., Bae, H., Eigenmann, R.:On the interaction of tiling and automatic parallelization. In:First International Workshop on OpenMP, Wompat.2005.
    [67]Lowenthal, D.K.:Accurately selecting block size at run time in pipelined parallel programs. International Journal of Parallel Programming 2000:28(3),245-274.
    [68]Priya Unnikrishnan, Jun Shirako, Kit Barton, et al. A Practical Approach to DOACROSS Parallelization[C]. Springer-Verlag Berlin Heidelberg. Euro-Par 2012, LNCS 7484,2012:219-231.
    [69]Allen Taflove. Computational Electridynamics[M]. Artech House Publishers,1995.
    [70]D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B. Cheng, P. R. Eaton, Q. B. Olaniran, andW.W. Hwu. Integrated predication and speculative execution in the IMPACT EPIC architecture [A]. In Proceedings of the 25th International Symposium on Computer Architecture [C], 1998:227-237.
    [71]S. Triantafyllis, M. J. Bridges, E. Raman, G. Ottoni, and D. I. August. A framework for unrestricted whole-program optimization [A]. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation [C], pages 61-71, June 2006.
    [72]R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array [A]. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques [C], pages 177-188, September 2004.
    [73]Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining [A], Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques [C], September 2007.
    [74]Ram Rangan, Neil Vachharajani, Guilherme Ottoni, et al.. Performance scalability of decoupled software pipelining [J]. ACM Transactions on Architecture and Code Optimization, Vol.5, No.2, Article 8.
    [75]Yuanming Zhang, Kanemitsu Ootsu, Takashi Yokoshi, et al.. Clustered Decoupled Software Pipelining on Commodity CMP [A]. Proceedings of the 14th IEEE International Parallel and Distributed Systems [C], pages 681-688, Dec.2008.
    [76]Jialu Huang, Arun Raman, Thomas B. Jablin et al.. Decoupled Software Pipelining Creates Parallelization Opportunities [A]. In:Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization [C]. Toronto, Ontario, Canada,2010.
    [77]J. Giacomoni, T. Moseley, and M. Vachharajani. FastForward for efficient pipeline parallelism:a cache-optimized concurrent lock-free queue [A]. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming [C]. pages 43-52, New York, NY, USA, February 2008.
    [78]NPB benchmark[EB/OL]. http://www.nas.nasa.gov/Resources/Software/npb_changes.html.
    [79]NAS parallel benchmark[EB/OL]. http://www.nas.nasa.gov/software/NPB.2005-09.
    [80]SPEC2000[EB/OL]. www.spec.org.
    [81]Martinez C, Pinnamaneni M, John E B. Multimedia workloads versus SPEC CPU2000[C]. In 2006 SPEC Benchmark Workshop, The University of Texas at San Antonio, January 2006.
    [82]M. E. Wolf, D. E. Maydan, and D.-K. Chen, Combining loop transformations considering caches and scheduling [A], in MICRO 29:Proceedings of the 29th annual ACM/IEEE international symposiumon Microarchitecture [C]. Washington, DC, USA:IEEE Computer Society,1996, pp.274-286.
    [83]K. S. McKinley, A compiler optimization algorithm for shared-memory multiprocessors [J]. IEEE Trans. Parallel Distrib. Syst., vol.9, no.8, pp.769-787,1998.
    [84]M. Voss and R. Eigenmann. Reducing parallel overheads through dynamic serialization [A]. In IPPS'99/SPDP'99:Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing [C]. pages 88-92, Washington, DC, USA,1999. IEEE Computer Society.
    [85]C. Ding and Y. Zhong, Reuse distance analysis, Dept. of Computer Science, University of Rochester, Tech. Rep.,2001.
    [86]M. Curtis-Maury, X. Ding, and C. D. A. D. S. Nikolopoulos, An evaluation of OpenMP on current and emerging multithreaded/multicoreprocessors [A], in FIRST INTERNATIONAL WORKSHOP on OpenMP [C], Eugene, Oregon USA, June 2005.
    [87]D. H. Albonesi and I. Koren, An analytical model of high performance superscalar-based multiprocessors [A], in PACT'95:Proceedings of the IFIPWG10.3 working conference on Parallelarchitectures and compilation techniques. Manchester [C]. UK:IFIP Working Group on Algol,1995, pp.194-203.
    [88]J. M. Bull and D. O'Neill, A microbenchmark suite for OpenMP 2.0 [A]. In Proceedings of the Third European Workshop on OpenMP (EWOMP'01) [C]. Barcelona, Spain, September 2001.
    [89]G. S. Almasi, E. Ayguad'e, C. Cascaval, J. Casta-nos, J. Labarta, F. M. X. Martorell, and J. E. Moreira, Evaluation of OpenMP for the Cyclops multithreaded architecture [A], in WOMPAT [C]. vol.2716, June 2003, pp.69-83.
    [90]A. E. Eichenberger, K. O'Brien, K. O. andPeng Wu, T. Chen, P. H. Oden, D. A. P. andJanice C. Shepherd, B. So, Z. Sura, A. W. andTao Zhang, P. Zhao, and M. Gschwind, Optimizing compiler for the CELL processor [A]. In:Proceedings of the 14th International Conference on ParallelArchitectures and Compilation Techniques (PACT'05) [C]. Washington, DC, USA:IEEE Computer Society,2005, pp. 161-172.
    [91]B. M. Chapman, L. Huang, G. Jost, and H. J. B. R. de Supinski, Support for flexibility and user control of worksharing in OpenMP [J]. National Aeronautics and Space Administration, Tech. Rep. NAS-05-015, October 2005.
    [92]C. Brunschen and M. Brorsson, OdinMP/CCp-a portable implementation of OpenMP for C [J]. Concurrency-Practice and Experience, vol.12, no.12, pp.1193-1203,2000.
    [93]W.-Y. Chen, Building a source-to-source UPC-to-C translator [D]. University of California at Berkeley, 2005.
    [94]V. S. Adve and M. K. Vernon, Parallel program performance prediction using deterministic taskgraph analysis [J]. ACM Trans. Comput. Syst, vol.22, no.1, pp.94-136,2004.
    [95]AMD multi-core:Introducing x86 multi-core technology and dualcoreprocessors from AMD [EB/OL]. http://multicore.amd.com/,2005.
    [96]C. Ding and Y. Zhong, Reuse distance analysis. Dept. of Computer Science, University of Rochester, Tech. Rep.,2001.
    [97]J. M. Bull and D. O'Neill, A microbenchmark suite for OpenMP 2.0 [A]. In:Proceedings of the Third European Workshop on OpenMP (EWOMP'01) [C]. Barcelona, Spain, September 2001.
    [98]S. Balakrishnan, R. Rajwar, M. Upton, and KonradLai, The impact of performance asymmetry in emerging multicore architectures [A]. In:Proceedings of the 32nd Annual International Symposiumon Computer Architecture. Washington, DC, USA:IEEE Computer Society,2005, pp.506-517.
    [99]Chunhua Liao. A compile-time OpenMP cost model[D]. Houston:University of Houston,2007.
    [100]Y. Zhao and K. Kennedy, Dependence-based code generation for a CELL processor [J]. Languages and Compilers for Parallel Computing. Volume 4382,2007, pp.64-79.
    [101]T. Chen, Z. Sura, K. O'Brien and K. O'Brien, Optimizing the use of static buffers for DMA on a CELL chip [J]. Languages and Compilers for Parallel Computing, Volume 4382,2007, pp.314-329.
    [102]M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev and P. Sadayappan, Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories [A], in Proc. PPoPP'08 [C]. February 2008, pp.1-10.
    [103]U. Bondhugula, A. Hartono, J. Ramanujam and P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer [A], in Proc. PLDI'08 [C]. June 2008, pp.101-113.
    [104]T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard and D. I. August, Automatic CPU-GPU communication management and optimization [A], in Proc. PLDI'11 [C]. June 2011, pp.142-151.
    [105]唐滔.面向CPU-GPU异构并行系统的编程模型与编译优化关键技术研究[D].湖南：国防科学技术大学,2011.
    [106]J. Gu and Z. Li, Efficient interprocedural array data-flow analysis for automatic program parallelization [J]. IEEE Transactions on Software Engineering-Special issue on architecture-independent languages and software tools for parallel processing, vol.26 Issue 3, March 2000.
    [107]C. Bastoul and A. Cohen, Putting polyhedral loop transformations to work [J]. Languages and Compilers for Parallel Computing, Volume 2958, pp.209-225,2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700