面向异构体系结构的任务流化技术研究

英文题名：Research on Task Flowing Technology on Heterogeneous Architecture System
作者：张丹
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：异构 ; 可重构 ; FPGA ; GPU ; 任务 ; 任务流 ; 负载均衡 ; 图划分
英文关键词：heterogeneous ; reconfigurable ; FPGA ; GPU ; task ; task flow ; load balance ; graph
英文关键词：partition
学位年度：2012
导师：赵荣彩
学科代码：081202
学位授予单位：解放军信息工程大学
论文提交日期：2012-04-15

摘要

异构体系结构是当前高性能计算研究中的一个重要发展方向。体系结构的异构不仅为高性能计算技术的发展提供了新思路，为高性能计算系统性能的提升带来了新的发展契机，也为研究人员提出了一系列诸如编程屏障、易用性差、使用门槛高等难题。
     首先，异构体系结构的计算系统中包含有多种不同结构特点的计算部件，结构上的差异导致了在使用这些器件进行计算时不能够统一而论，需要分别针对每种计算部件的结构特点进行区别对待；其次，多种类型的应用程序（或算法）与异构计算系统中的计算部件都是完全异构的，目前还没有一种计算模型或者程序开发模式能够完全满足各种异构计算系统的需求，软件开发所需的高度抽象化与底层硬件之间的透明化导致了利用异构高性能计算系统时困难重重；此外，目前流行的新型计算部件虽然具有较强的计算能力，但受限于其自身的开发特点，对目前现有的各种高性能计算程序的继承性较差，并且对编程人员的专业性要求较高，缺乏一种通用的、能够被普遍接受的基于异构计算系统的并行程序设计模型。因此，如何能够根据异构计算系统中计算部件的计算特性和应用程序的执行特征，形成计算部件与应用程序中各个任务之间的有效映射，从而降低编程人员程序开发门槛，减少异构高性能计算系统的使用难度就成为了当前异构高性能计算技术发展过程中亟待解决的问题。
     针对异构高性能计算系统发展中所遇到的这些问题，本文开展的主要研究内容及创新点包括以下几个方面：
     1、针对异构计算系统中各种计算部件和应用程序之间的双重异构问题，提出了一种面向异构计算系统的任务流模型。通过对异构计算系统中各种异构计算部件计算特性和应用程序执行过程中计算特征的抽象描述，并经一系列的分析、评估手段形成异构计算部件与应用程序所划分出的各个任务之间的有效映射和硬件配置模式。从而提高使用异构计算部件进行计算加速的针对性和有效性，降低了编程人员的程序开发门槛，减小了异构高性能计算系统的使用难度。
     2、针对当前主流的异构计算部件——GPU和FPGA，分别提出了针对这两种异构计算部件的并行开销模型。模型通过对GPU和FPGA硬件结构特点的抽象，提取影响硬件执行性能的关键因素P（处理能力）、M（存储能力）和C（通信能力）进行分析与评估，计算出应用程序中各个任务在异构计算部件上执行时的性能差异。实验结果表明，这两种并行开销模型能够有效的对异构计算部件执行性能进行评估，其误差均不超过10%，能够提高使用异构计算部件进行计算加速的针对性和有效性。
     3、针对任务流模型中所抽象出的异构计算系统硬件结构模型的特点，提出了一种基于静态负载均衡策略的异构计算系统节点间任务划分方法。该方法通过采用对应用程序的抽象，利用任务流图对应用程序执行特征进行描述，把负载均衡问题转化成为图划分问题。通过对主流图划分工具中PMETIS划分算法的改进实现了计算节点间通信量较小、所划分任务子集计算量均衡的目标。实验结果表明，在负载均衡度近似的情况下，改进后的P-PMETIS算法具有更好的执行效率，其划分耗时更短。
     4、针对异构计算部件之间在计算能力、存储能力和通信能力上的差异，提出了一种异构计算部件的任务映射方法。该方法利用异构计算部件性能评估结果和任务流图中任务的属性权值，通过改变任务流图中可并行任务的执行硬件类型，计算出整个任务流图的关键路径长度，并选取关键路径最短的任务流图形式作为异构计算部件与任务之间的映射关系，实现了异构计算部件与任务流图中任务的合理映射。实验结果表明，应用程序整体执行性能可以提升3倍，映射至GPU和FPGA的任务加速比分别可以达到14倍和43倍。
     5、针对异构计算系统对当前各种高性能计算程序继承性较差、对编程人员专业性要求较高的问题，本文在任务流模型的基础上，基于Open64开源编译器实现了一种面向异构体系结构的任务流化工具。通过任务流化工具中的任务划分、流化、粒度调整与映射实现了应用程序的源源变换。采用在应用程序中添加相应编译指示语句的方式减低编程难度，提高了异构计算部件的使用效率。实验结果表明，映射至GPU和FPGA上执行的任务的加速比分别可以达到23.43倍和8.86倍，应用程序整体执行性能较之串行执行方式提升了4倍，与MPI并行方式相比平均可以提高2倍，达到了较好的加速效果，为应用程序向异构计算系统的移植和应用提供了有益参考。
Heterogeneous architecture is an important developing direction of present study of highperformance computing. Heterogeneous architecture not only provides a new developmentmethod for high performance computing technology, but also brings new developmentopportunities for the improvement of high performance computing system. Nevertheless, it alsoputs forward a series of problems for the researchers, such as programming barrier, poor usability,and high requirement for users.
     First of all, heterogeneous architecture computing system includes a variety of computingcomponents with different architectures, which results in different usage of these components,based on their hardware architecture features. Under this situation, each kind’s architecture ofheterogeneous computing components should be analyzed respectively. Second, differentapplications (or algorithms) and disparate computing components are completely heterogeneous,which means there is no one computation model or programming mode that can fully meet thevarious needs of heterogeneous architecture computing systems right now. The high levelabstraction of programming and the transparency of hardware make it extremely difficult to usethe heterogeneous computing systems. In addition, the present popular heterogeneous computingcomponents have strong processing capacity, but they are still limited by its own designcharacteristics, such as poor inheritance of high performance computing program, highrequirement of the programmer and the lack of a generally acceptable parallel application designmode based on heterogeneous architecture. Therefore, based on the features of disparatecomponents of heterogeneous computing systems and the execution characteristics of differentapplications, make reliable mapping between each computing components and tasks, reduce thedifficulty of programming and usage are the urgent problems that need to solve in heterogeneouscomputing technology research.
     For the problems and challenges faced by high performance computing technologydevelopment based on heterogeneous architecture, the main contents and innovations of thedissertation includes as follows:
     1. According to double heterogeneity of disparate computing components and differentapplications, this dissertation puts forward a task flow model which based on heterogeneousarchitecture. Based on the features’ abstraction of heterogeneous computing componentshardware and the characteristic’s abstraction of application execution, the reasonable mappingand logical configuration of hardware are achieved by a series of analysis and estimation method.Accordingly, the pertinence and effectiveness of using heterogeneous computing components toaccelerate application execution is improved. The difficulty of using heterogeneous computingsystem and programming is reduced.
     2. For heterogeneous computing components’ diversity of hardware architecture andimplementation mode, based on the analysis of GPU and FPGA, this dissertation puts forwardtwo kinds of parallelization cost model respectively. One is GPU parallelization cost model, the other is FPGA parallelization cost model. Through analysis of heterogeneous hardwarearchitecture, abstracts the parametric representation of the key factors which affect theimplementation performance of GPU and FPGA, such as processing capacity, memory accesscapacity and communication capacity, estimates application execution performance byconsidering all overhead in the implementation process of GPU and FPGA. The experimentalresults show that the mean error of GPU parallelization cost model does not exceed10%, andalong with increase of data size of the matrix, the estimation results are more accurate. Theestimation results of FPGA parallelization cost model is lower than actual execution results, theerror rate is less than9.8%, and along with the increase of data size, the error rate is more andmore small. The accuracy of using GPU or FPGA to improve the performance for applicationexecution is achieved.
     3. According to the property of heterogeneous computing system architecture from taskflow model, this dissertation presents a task partitioning method based on static load balancingstrategy among heterogeneous computing system nodes. In this method, the executioncharacteristics of application are representation by task flow graph through the abstraction of theapplication, thus use graph partition algorithm to solve the load balance among heterogeneouscomputing system nodes. A P-PMETIS algorithm is been proposed through the improvement ofthe PMETIS algorithm of graph partition tools. The targets of reduce the communication amongheterogeneous computing system nodes and the volume of calculation balanced are achieved.The experimental results show that P-PMETIS algorithm has better efficiency and shorterexecution time than PMETIS algorithm under the situation of similar load balance degree.
     4. For the diversity of heterogeneous computing components in processing capacity,memory access capacity and communication capacity, a task mapping method is proposed basedon heterogeneous architecture. This method using the results of heterogeneous computingcomponents performance evaluation and the vertexes’ attribution of task flow graph, select thetask flow graph as mapping scheme which critical path is shortest by changing the mappingrelation between heterogeneous computing components and parallel process able tasks. After thatthe reasonable mapping relation is achieved. The experiment results show that the speedup oftasks which are mapped to FPGA can reach43times, the speedup of tasks which are mapped toGPU can reach14times, and the speedup of entire application can reach three times.
     5. For the poor inheritance of traditional high performance program and high requirement ofprogrammer, a task flowing tool is proposed based on Open64compiler. The source to sourcetransformation of application is achieved through task partition, task flowing, granularityadjustment and task mapping. By this way, the difficulty of programming can be reduced, theefficiency of using heterogeneous computing components can be improved by add theappropriate compiler directive statement. Experimental results show that, the task flowing toolscan generate the mapping between hardware and software based on PMC features ofheterogeneous computing components and PMC characteristic of the applications in executionprocess. The application execution performance can be improved by using these tools, and thegoal of guidance the program rewriting can be achieved.

引文

[1] TOP500Supercomputing Sites [EB/OL]. http://www.top500.org,2012-3-13.
    [2] G. Bell. Scalable Parallel Computer: Alternatives Issues and Challenges [J]. InternationalJournal of Parallel Programming,1994,22(1):3-46.
    [3]桂亚东.高效能计算机发展趋势[J].高性能计算机发展与应用,2007,20(3):11-18.
    [4]王普勇，桂亚东，王涛.支持科技创新的高效能计算[J].科学,2009,5:5-10.
    [5] Richard Wain, Ian Bush, Martyn Guest, Miles Deegan, Igor Kozin and Christine Kitchen.An overview of FPGAs and FPGA programming; Initial experiences at Daresbury[R].Warrington, Cheshire, UK: Computational Science and Engineering Department, CCLRCDaresbury Laboratory,2006.
    [6]安虹.用可重构计算技术实现高效能通用微处理芯片[J].信息技术快报,2006,4(6):1-23.
    [7] CRAY INC. Cray XD1Overview [EB/OL]. http://www.pixelbias.com/offlineWeb/cray/products/xd1/index.html,2012-3-13.
    [8] Koch K. Roadrunner Platform Overview [EB/OL].http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Koch-Roadrunner Overview/RR Seminar-System Overview.pdf,2012-3-13:19-29.
    [9] TOP500List-June2011[EB/OL]. http://www.top500.org/lists/2011/06,2011-6.
    [10] K. H. Tsoi and W. Luk. Axel: A heterogeneous cluster with FPGAs and GPUs [A]. InProceedings of ACM/SIGDA International Symposium on Field-Programmable GateArrays (FPGA)[C], Monterry, California: ACM,2010:115-124.
    [11] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan.Brook for GPUs: stream computing on graphics hardware [J]. ACM Transactions of Graph,2004,23(3):777–786.
    [12] W. Thies, M. Karczmarek, and S. Amarainghe. StreamIt: A language for streamingapplications [A]. In Proceedings of Conference on Compiler Construction[C], Grenoble,France：Springer,2002:49–84.
    [13] U. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. Owens.Programmable stream processors [J]. IEEE Computer,2003,36(8):54–62.
    [14] L.Howes, P. Price, O. Mencer, O. Beckmann, and O. Pell. Comparing FPGAs to GraphicsAccelerators and the Playstation2using a unified source description. Field ProgrammableLogic and Applications [A], In Proceedings of Conference on FPL’06[C]. Madrid, Spain:IEEE,2006:1-6,28-30.
    [15] Kuen Hung Tsoi, Anson Tse, Peter Pietzuch, Wayne Luk. Programming Framework forClusters with Heterogeneous Accelerators [A]. In Proceedings of Conference onHEART2010[C]. Tsukuba, Japan: ACM,2010:38(4),60-65.
    [16] A.E.Eichenberger, etc. Using Advanced Compiler Technology to Exploit the Performanceof the Cell Broadband Engine Architecture [J]. IBM Systems Journal,2006,45(1):59-84.
    [17] M.Ohara, H. Inoue, Y. Sohda, H. Komatsu, and T. Nakatani. MPI Microtask forProgramming the Cell Broadband EngineTM Processor [J]. IBM Systems Journal,2006,45(1):85-102.
    [18] John Stratton, Sam Stone, and Wen mei Hwu. MCUDA: AN Efficient Implementation ofCUDA Kernels on Multi core cpus [R]. Edmonton, Canada：In21st Annual Workshop onLanguage and Compilers for Parallel Computing (LCPC'2008), July2008. URLhttp://www.gigascale.org/pubs/1328.html.
    [19] Ashley Brown. Charisma: Compiler and Run-time for Irregulat Multiprocessor Systems[EB/OL]. http://www.doc.ic.ac.uk/`awbol/projects/ChaRisMA,20061-15.
    [20] A. Nayak, M. Haldar, A. Kanhere, P. Joisha, N. Shenoy, A. Choudhary, and P. Banerjee. Alibrary based compiler to execute MATLAB programs on a heterogeneous platform[EB/OL]. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.0.0.36.4659,2010.
    [21] L. Genovese, A. Neelov, S. Goedecker, T. Deutsh, S.A. Ghasemi, A. Willand, D. Caliste, O.Zilberberg, M. Rayson, A. Bergnam, R.Schneider. Daubechies wavelets as a basis set fordensity functional pseudopotential calculations [J]. The Journal of Chemical Physics,2008,129.014109.
    [22] L. Genovese, M. Ospici, T. Deutsch, J.F. Mehaut, A. Neelov, S. Goedecker. Densityfunctional theory calculation on many-cores hybrid CPU-GPU architectures [EB/OL].http://www.citebase.org/abstract?id=oai:arXiv.org:0904.1543,2009.
    [23] Friedrichs, M.S., Eastman, P., Vaidyanathan, V., Houston, M., Legrand, S., Beberg, A. L.,Ensign, D.L., Bruns, C.M., and Pande, V.S. Accelerating Molecular Dynamic Simulationon Graphics Processing Units [J]. Jouranl of Computational Chemistry,2009,30(6):864-872.
    [24] Yannick Allusse, Patrick Horain, Ankit Agarwal, and Cindula Saipriyadarshan. GpuCV: AnOpenSource GPU-Accelerated Framework for Image Processing and Computer Vision [A].In: Proceeding of the16th ACM international conference on Multimedia [C]. New York,NY, USA, ACM,2008:1089-1092.
    [25] Farrugia J.P., Horain P., E. Guehenneux, and Y. Alusse. GPUCV: A FRAMEWORK FORIMAGE PROCESSING ACCELERATION WITH GRAPHICS PROCESSORS [A]. In:Proceedings of Conference on Multimedia and Expo2006[C]. Toronto, Canada: IEEE,2006:585-588.
    [26] Y Allusse, P Horain, A Agarwal, C. GpuCV: A GPU-accelerated framework for ImageProcessing and Computer Vision [A]. In: Proceedings of the4th International Symposiumon Advances in Visual Computing [C]. Berlin, Heidelbery: Springer-Verlag,2008: Part II,430-439.
    [27] A. Kerr, D. Campbell, and M. Richards. GPU VSIPL: High-Performance VSIPLImplementation for GPUs [R]. Lexington, MA, USA: In HPEC'08: High PerformanceEmbedded Computing Workshop,2008.
    [28] B. He, W. Fang, Q. Luo, N.K. Govindaraju, and T. Wang. Mars: A MapReduce Frameworkon Graphics Processors [A]. In: Proceedings of Conference on PACT (parallel architectureand compilation techniques)[C]. New York, USA: ACM,2008:260-269.
    [29] Yonghong Yan, Max Grossman, Vivek Sarkar. JCUDA: A Programmer-Friendly Interfacefor Accelerating Java Programs with CUDA [A]. In: Proceedings of Conference on the15thInternational Euro-Par Conference on Parallel Proceeding [C]. Berlin, Heidelberg:Springer-Verlag,2009:887-899.
    [30] Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. Merge: AProgramming Model for Heterogeneous Mutil-core Systems [A]. In: Proceedings ofConference on Languages and Operating Systems [C]. New York, NY, USA: ACM,2008:287-296.
    [31] Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU a compilerframework for automatic translation and optimization [A]. In: Proceedings of Conferenceon the14th ACM SIGPLAN symposium on Principles and practice of parallelprogramming [C], New York, NY, USA: ACM,2009:101-110.
    [32] Mark Govett. Development and Use of a Fortran to CUDA Translator to Run NOAAWeather Models [EB/OL]. http://www-ad.fsl.noaa.gov/ac/Accelerators.html,2009-3.
    [33] Z. Guo, and W. Najjar. A Compiler Intermediate Representation for Reconfigurable Fabrics
    [A]. In: Proceedings of Conference on International Conference on Field ProgrammableLogic and Applications(FPL06)[C]. Madrid Spain: IEEE,2006:741-744.
    [34] J Hammes, R Rinker, W Najjar, B Draper. A High Level, Algorithmic ProgrammingLanguage and Compiler for Reconfigurable System [EB/OL]. Workshop on theEngineering of Reconfigurable,2000. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.1671&rep=rep1&type=url&i=0,2011-6-12.
    [35] A Papakonstantinou, K Gururaj, JA Stratton, Deming Chen, Jason Cong, Wen-Mei W. Hwu.High-Performance CUDA Kernel Execution on FPGAs [A]. In: Proceedings of the23rdinternational conference on Supercomputing [C]. Yorktown Heights, NY, USA: ACM,2009:515-516.
    [36] D Cabrera, X Martorell, G Gaydadjiev. OpenMP extensions for FPGA Accelerators[EB/OL]. http://www.hipeac.net/system/files/fpga_samos.pdf,2011-6-12.
    [37]沈英哲.可重构计算系统中软硬代码划分技术研究[D].合肥:中国科学技术大学,2007.
    [38] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro etc. The Landscape of ParallelComputing Research: A View from Berkeley[R]. California Berkeley: ElectricalEngineering and Computer Sciences University of California at Berkeley,2006.
    [39] Polychronopoulos C. D. The hierarchical task graph and its use in auto scheduling [A]. In:Proceedings of5thInternational Conference on Supercomputing (ICS’91)[C]. New York,USA: ACM,1991:252-263.
    [40]张舒,楮艳利. GPU高性能运算之CUDA [M].北京:中国水利水电出版社,2009:14-76.
    [41] Erik Lindholm, John Nickolls, Stuart Oberman. Nvidia Tesla: a unified graphics andcomputing architecture [A]. In: Proceedings of Conference on IEEE Micro-Institute ofElectrical and Electronics Engineers [C]. Los Alamitos, CA: IEEE,2008:39-55.
    [42] Advanced Micro Devices, Inc. AMD Brook+[EB/OL].http://ati.amd.comtechnology/streamcomputing/AMD-Brookplus.pdf,2011.
    [43] Khronos. Opencl-the open standard for parallel programming of heterogeneous systems[EB/OL]. http://www.khronos.org/opencl/,2011.
    [44] PHARR M. GPU Gems2: Programming techniques for high performance graphics andgeneral purpose computation [M].北京:清华大学出版社,2005:47-68.
    [45] NGUYEN, H. GPU Gems3[M].北京:清华大学出版社,2007:3-59.
    [46] LUO, Y., AND DURAISWAMI, R. Canny Edge Detection on Nvidia CUDA [A]. In:Proceedings of Conference on IEEE Computer Vision and Pattern Recognition (2008)[C].Washington DC, USA: IEEE,2008:1-8.
    [47]车永刚.科学计算程序性能分析与优化关键技术研究[D].长沙:国防科技大学，2004.
    [48] B. M. Maggs, L. R. Matheson, and R. E. Tarjan. Models of parallel computation: a surveyand synthesis [A]. In: Proceedings of the28th Hawaii International Conferenceon SystemSciences (HICSS’95)[C]. Washington, DC, USA: IEEE Computer Society,1995:61.
    [49]李文龙,林海波,汤志忠.软件流水的开销模型和决策框架[J],软件学报,2004,15(7):1005-1011.
    [50] T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model [A]. In:Proceedings of Conference on ISCA [C]. Munich, Germany: IEEE,2004:338-349.
    [51] X. E. Chen and T. M. Aamodt. A first-order fine-grained multithreaded throughput model
    [A]. In: Proceedings of Conference on HPCA2009[C]. Raleigh, USA: IEEE,2009:329-340.
    [52] R. H. Saavedra-Barrera and D. E. Culler. An analytical solution for a markov chainmodeling multithreaded [R]. Berkeley, CA: University of California at Berkeley,1991.
    [53] D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A.Wood. Analytic evaluation ofshared-memory systems with ILP processors [A]. In: Proceedings of Conference on ISCAInternational Symposium on Computer Architecture (CISCA’98)[C]. New York: ACM,1998:380-391.
    [54] Govindaraju N K, Larsen S, Gray J, etc. A memory model for scientific algorithms ongraphics processors [A]. In: Proceedings of the ACM/IEEE Conference on Supercomputing[C]. Tampa USA: ACM/IEEE,2006:6.
    [55] Samuel Williams, Andrew Waterman, David Patterson, Roofline: an insightful visualperformance model for multicore architectures[J]. Communications of the ACM,2009,52(4):65-76.
    [56]韩博,周秉锋. GPGPU性能模型及应用实例分析[J].计算机辅助设计与图形学报,2009,21(9):1219-1226.
    [57] S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level andthread-level parallelism awareness [A]. In: Proceedins of the36thInternational Symposiumon Computer Architecture (ISCA2009)[C]. New York, USA: ACM,2009:152-163.
    [58] A.brekling, MR Hansen, Jan Madsen. Models and Formal verification of multiprocessorsystems on chip [J]. Journal of Logic and Algebraic Programming,2008,77(1-2):155-166.
    [59] S.Mohl. The Mitrion-C Programming Language [EB/OL]. Mitrionics Inc.(2005),http://www.mitrionics.com/forum/,2005.
    [60]Celoxica Ltd. Handel-C Language Reference Manual (2003)[EB/OL].http://www.celoxica.com/techlib/files/CEL-W0410251JJ4-60.pdf,2003.
    [61] Impulse Accelerated Technologies, Inc.. Impulse C [EB/OL]. http://www.impulsec.com,2011.
    [62] Nallatech, Inc.. DIME-C [EB/OL]. http://www.nallatech.com/,2010.
    [63] R.Ernst, J.Henkel, and T.Benner. Hardware-software cosynthesis for microcontrollers [J].IEEE Des&Test of Comput.,1993,10(4):64-75.
    [64] R.Gupta and G.De Micheli. Hardware-software co synthesis for digital systems [J]. IEEEDes&Test of Comput.,1993,10(3):29-41.
    [65] A.Kalavade and E.A. Lee. A global criticality/local phase driven algorithm for theconstrained hardware/software partitioning problem [A]. In: Proceedings of theInternational Workshop on Hardware-software Co-design [C]. Los Alamitos, CA, USA:IEEE,1994:42-48.
    [66] W.Wolf. Hardware/software co-design of embedded systems [J]. Proceedings of the IEEE,1994,82(7):967-989.
    [67] D.Densmore, A.Donlin, and A.Sangiovanni-Vincentelli. FPGA ArchitectureCharacterization for System Level Performance Analysis [A]. In Proceedings ofConference on Design Automation and Test Europe2006(DATE2006)[C]. Munich,Germany: IEEE CS Press,2006:1-6.
    [68] Y.Li, T.Callahan, E.Darnell, R.E.Harr, U.Kurkure, and J. Stockwood. Hardware-Softwareco-design of embedded reconfigurable architecture [A]. In Proceedings of the DesignAutomation Conference (DAC‘00)[C]. Los Angeles, USA：ACM,2000:507-512.
    [69] CHEN Yuan-feng, TANG Pu-shan, LAI Jin-mei, and TONG Jia-rong. Evaluation Systemfor FPGA [J]. Journal of Fudan University (Natural Science),2006,45(4):523-528.
    [70] Shang Li and N.K.Jha. Hardware-Software Co-Synthesis of Low Power Real-TimeDistributed Embedded System with Dynamically Reconfigurable FPGA’s [J]. IEEETransaction on Computer Aided Design of Intergated Circuits Systems,2007,26(3):508-526.
    [71] N.Shenoy, A.Choudhary, and P.Banerjee. An algorithm for synthesis of largetime-constrained heterogeneous adaptive systems [J]. ACM Trans. Design Automation ofElectronic Systems,2001,6(2):207-225.
    [72] Bossuet L, Gogniat G, Philippe J L. Fast design space exploration method forreconfigurable architectures [A]. In: Proceedings of the International Conference onEngineering of Reconfigurable Systems and Algorithms [C]. Las Vegas: ERSA2003,2003:65-71.
    [73] Ji Ai-ming, Shen Hai-bin, Yan Xiao-lang. A Fast Method for Reconfigurable ArchitectureDesign Space Exploration [J]. Journal of Electronics&Information Technology.2006,28(9):1744-1747.
    [74] R. Enzler, T. Jeger, D. Cottet, G. Trostler. High-Level Area and Performance Estimation ofHardware Building Blocks on FPGAs [A]. In: Proceedings of Conference on FPL2000[C].London, UK: Springer,2000: Vol1896,525-534.
    [75] Hartenstein R, Herz M, Hofmann T, et al.. KressArray Xplorer. A new CAD environmentto optimize reconfigurable data path array architectures [A]. In: Proceedings of theASP-DAC2000[C]. Yokohama, Japan: ACM,2000:163-168.
    [76] Betz V, Rose J, Marquart A. Architecture and CAD for Deep Submicron FPGAs [M]. MA,USA: Kluwer Academic Publishers,1999:50-61.
    [77] Jin Qu, Rongcai Zhao, Taogang Liu, Dan Zhang, Lin Han. The Research of FPGA-basedLoop Optimization Pipeline Scheduling Technology [A].In: Proceedings of Conference onthe CCTAE2010[C]. Chengdu, China: IEEE,2010:426-429.
    [78] H.Noori, F.Mehdipou, K.Murakami. A Reconfigurable Functional Unit for an AdaptiveDynamic Extensible Processor [A]. In: Proceedings of the16th IEEE InternationalConference on Field Programmable Logic and Applications (FPL2006)[C]. Madrid, Spain:IEEE,2006:781-784.
    [79] A.Mitra, Z.Guo, A.Banerjee, W.Najjar. Dynamic Co-Processor Architecture for SoftwareAcceleration on CsoCs [A]. In: Proceedings of IEEE International Conference onComputer Design (ICCD2006)[C]. Washington DC, USA: IEEE,2006:127-133.
    [80] Guochang Zhou, Xubang Shen. An Architecture of Dynamically ReconfigurableProcessing Unit(RPU)[A]. In: Proceedings of Workshop on Parallel Processing (2007)[C].Washington DC, USA: IEEE Computer Society,2007:20-21.
    [81] Livermore Benchmarks [EB/OL]. http://www.netlib.org/benchmark/livermorec,2011.
    [82]郭龙,陈闳中,叶青.构造串行程序对应的并行任务图[J].计算机工程与应用,2007,43(1):41-46.
    [83]曾国荪.异构计算环境中循环级并行性提取及调度[J].计算机工程,2000,26(3):52-54.
    [84]胡凯,姜燕,陈诗然.一种扩展的随机DAG模型[J].北京航空航天大学学报,2008,34(4):400-403.
    [85] Kamthe Ankur, Lee S Y. A stochastic approach to estimating earliest start times of nodesfor scheduling DAGs on heterogeneous distributed computing systems [A]. In: Proceedingsof the19thIEEE International Parallel and Distributed Processing Symposium [C]. Denver,Colorado, USA: IEEE,2005:121b-121b,04-08.
    [86] Midorikawa Edson T, de Oliveira Helio M, Laine Jean M. PEMPIs: a new methodology formodeling and prediction of MPI programs performance [J]. International Journal of ParallelProgramming,2005,33(5):499-527.
    [87]崔焕庆.基于Petri网的MPI并行程序建模与正确性验证[D].泰安:山东科技大学计算机科学与工程学院,2004.
    [88] Paradyn Project. Paradyn developer’s guide [EB/OL].ftp://grilled.cs.wisc.edu./paradyn_manuals/developerGuide.pdf,2001.
    [89] Wu C Eric, Bolmarcich Anthony. Gantt chart visualization for MPI and apachemulti-dimensional trace files [A]. In: Proceedings of Conference on Parallel andDistributed Systems [C]. Washington DC, USA: IEEE,2002:523-528.
    [90] Rastislav Bodík, Rajiv Gupta, Mary Lou Soffa. Interprocedural conditional branchelimination [J]. ACM SIGPLAN Notices,1997,32(5):146-158.
    [91]沈志宇,胡子昂,廖湘科等.并行编译方法[M].北京:国防工业出版社,2001:38-42.
    [92]杜澎.数据分布和收集代码自动生成及优化技术研究[D].郑州：解放军信息工程大学硕士学位论文,2007.
    [93] Randy Allen, Ken Kennedy.现代体系结构的优化编译器[M].北京:机械工业出版社,2004:7.
    [94] U. Banerjee. Data dependence in ordinary programs[M]. Department of Computer Science,University of Illinois at Urbana-Champaign, November1976. Report No.76-837.
    [95] U. Banerjee, R. Eigenmann and A. Nicolau. Automatic program parallelization [J]. ProcIEEE,1993,8(12):211-243.
    [96] X. Kong, D. Klappholz and K. Psarriss. The I test: an improved dependence test forautomatic parallelization and vectorizaton [J]. IEEE Trans Parallel Distrb Syst,1991,2(3):342-349.
    [97] G. Goff, K. Kennedy and C.-W. Tseng. Practical dependence testing [A]. In: Proceedings ofthe SIGPLAN’91Conference on Programming Language Design and Implementation [C].New York, USA: ACM,26(6),1991:15-29.
    [98] Atsushi KUBOTA, Shogo TATSSUMI, Toshihiko TANAKA. A Technique to EliminateRedundant Inter-Processor Communication on Parallelizing Compiler TINPAR [J].International Journal of Parallel Programming,1999,27(2):97-109.
    [99] Gu Junjie, Li Zhiyuan. Efficient Interprocedural Array Data-flow Analysis for AutomaticProgram Parallelization [J]. IEEE Transactions on Software Engineering, the special issueon Architecture-Independent Languages and Software Tools for Parallel Processing, Mar2000:244-261.
    [100] Dror E. Maydan. Accurate analysis of array references [D]. California USA: StanfordUniversity,1992.
    [101]Dror E. Maydan, Saman P. Amarasinghe, Monica S. Lam. Data Dependence and Data-FlowAnalysis of Arrays [A]. In: Proceedings of5th Workshop on Programming Languages andCompilers for Parallel Computing[C]. New Haven, USA: Springer-Verlag,1992:133-148.
    [102] Dror E. Maydan, Saman P. Amarasinghe, Monica S. Lam. Array Data-Flow Analysis andits Use in Array Privatization [A]. In: Proceedings of ACM Conference on Principles ofProgramming Languages[C]. Washington DC, USA: ACM,1993:2-15.
    [103]张平.并行化编译器中并行程序自动生成和性能优化技术研究[D].郑州:解放军信息工程大学,2006.
    [104]马宏图. OpenMP程序分析及优化技术研究[D].郑州:解放军信息工程大学,2009.
    [105]任华.数组数据流分析算法的优化和数组私有化技术的研究与实现[D].郑州:解放军信息工程大学,2007.
    [106] Henkel Jorg, Ernst Rolf. The interplay of run-time estimation and granularity in hw/swpartitioning [A]. In: Proceedings of International Workshop on Hardware/SoftwareCodesign, Pittsburgh [C]. Washington DC, USA: IEEE,1996:52-58.
    [107] Hendrickson B, Tamara G K. Graph partitioning models for parallel computing [J].Parallel Computing,2000,26(12):1519-1534.
    [108] Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregulargraphs [J]. SIAM Journal on Scientific Computing,1998,20(1):359-392.
    [109] Berger M, Bokhari S. Partitioning strategy for nonuniform problems on multiprocessors[J]. IEEE Transactions on Computers,1987, C-36(5):570-580.
    [110] Heath M, Raghavan P. A Cartesian parallel nested dissection algorithm [J]. SIAM Journalof Matrix Analysis and Applications,1995,16(1):235-253.
    [111] Nour-Omid B, Raefsky A, Lyzenga G. Solving finite element equations on concurrentcomputers[A]. American Society of Mechanical Engineers. Parallel computations and theirimpact on mechanics. In: Proceedings of Conference on the Symposium1987[C]. MA,Boston, United Station: ASME,1987:209-227.
    [112] Pothen A. Graph partitioning algorithms with applications to scientific compouting.Parallel Numerical Algorithms [M]. London, UK: Kluwer Academic Press,1997:323-368.
    [113] Ou C W, Ranka S. Parallel incremental graph partitioning using linear programming [A].In: Proceedings of the ACM/IEEE conference on Supercomputing [C]. Washington DC,USA: ACM/IEEE,1994:458-467.
    [114] Ou C W, Ranka S, Fox G. Fast and parallel mapping algorithms for irregular and adaptiveproblems [J]. Journal of Supercomputing,1996,10(2):119-140.
    [115] Patra A, Kim D. Efficient mesh partitioning for adaptive hp finite element methods [R].Buffalo: Dept. of Mechanical Eng., State University of New York,1999.
    [116] Pilkington J, Baden S. Partitioning with spacefilling curves [R]. California: Department ofComputer Science and Engineering, University of California,1994.
    [117] Hilbert D. Uber die stetige abbildung einer linie auf ein achenstuck [J]. Math Annln.,1891,38:459-460.
    [118] Warren M, Salmon J. A parallel hashed oct-tree n-body algorithm [A]. In: Proceedings ofConference on Supercomputing’93[C]. New York, USA: ACM,1993:12-21.
    [119] Chung Y, Ranka S. Mapping finite element graphs on hypercubes [J]. Jouranl ofSupercomputing,1992,6(3):257-282.
    [120] Goehring T, Saad Y. Heuristic algorithms for automatic graph partitioning [R]. Minnesota:Department of Computer Science of University of Minnesota, US,1994.
    [121] Sadayappan P, Ercal F. Mapping of finite element graphs onto processor meshes [J]. IEEETransactions on Computers,1987,36(12):1408-1424.
    [122] Kernighan B, Lin S. An efficient heuristic procedure for partitioning graphs [J]. The BellSystem Technical Journal,1970,49(2):291-307.
    [123] Fiduccia C, Mattheyses R. A linear time heuristic for improving network partitions [A]. In:Proceedings of the19thDesign Automation Conference [C]. Piscataway, NJ, USA: IEEE,1982:175-181.
    [124] Ashcraft C, Liu J. Using domain decomposition to find graph bisectors[R]. NorthYork,Ontario, Canada: York University,1995.
    [125] Diemann R, Monien B, Preis R. Using helpful sets to improve graph bisections [J].Interconnection Networks and Mapping a Scheduling Parallel Computations,1995,21:57-73.
    [126] Hager W, Park S, Davis T. Block exchange in graph partitioning [M]. London, UK:Kluwer Academic Press,1999:298-307.
    [127] Bui T, Jones C. A heuristic for reducing fill in sparse matrix factorization [A]. In:6thSIAM Conference Parallel Processing for Scientific Computing [C]. Philadelphia, USA:Society for Industrial&Applied,1993:445-452.
    [128] Hendrickson B, Leland R. A multilevel algorithm for partitioning graphs [A]. In:Proceedings of Supercomputing’95, ACM/IEEE conference on Supercomputer (CDROM)
    [C]. New York, USA: ACM,1995:28.
    [129] Hauck S, Borriello G. An evaluation of bipartitioning technique [J]. IEEE Transaction onComputer Aided Design of Intergrated Circuits and Systems,1997,16(8):849-866.
    [130] Cong J, Smith M. A parallel bottom-up clustering algorithm with applications to circuitpartitioning in vlsi design [A]. In: Proceedings of ACM/IEEE Design AutomationConference [C]. New York, USA: ACM/IEEE,1993:755-760.
    [131] Karypis G, Kumar V. Multilevel k-way partitioning scheme for irregular graphs [J].Journal of Parallel and Distributed Computing,1998,48(1):96-129.
    [132] Pothen A, Simon H, Liou K. Partitioning sparse matrices with eigenvectors of graphs [J].SIAM Journal of Matrix Analysis and Applications,1990,11(3):430-452.
    [133] Pothen A, Simon H, Wang L, et al. Towards a fast implementation of spectral nesteddissection [J]. Supercomputing’92Proceedings,1992,16(20):42-51.
    [134] Barnard S, Simon H. A fast multilevel implementation of recursive spectral bisection forpartitioning unstructured problems [J]. Concurrency and Computation Practice andExperience,1994,6(2):101-117.
    [135] Hendrickson B, Leland R. An improved spectral graph partitioning algorithm for mappingparallel computations [J]. SIAM Journal of Scientific Computing,1995,16(2):452-469.
    [136] Parlett B, Simon H, Stringer L. On estimating the largest eigenvalue with the lanczosalgorithm [J]. Mathematics of Computation,1982,38(137):153-165.
    [137] Ou C W, Ranka S. Parallel Incremental Graph Partitioning [J]. IEEE Transactions onParallel and Distributed Systems,1997,8(8):884-896.
    [138] Ernst A, Jiang H Y. Exact Solutions to Task Allocation Problems [J]. Management Science,2006,52(10):1634-1646.
    [139]林皎,陈文光,栗强等.基于图划分的全基因组并行拼接算法[J].计算机研究与发展,2006,43(8):1323-1329.
    [140]周静,曾国荪.基于DAG图的自适应代码划分优化算法[J].计算机工程,2007,33(20):15-17.
    [141]王启东.基于数学规划的图划分模型研究[D].大连:大连理工大学,2009.
    [142] Karypis G, Kumar V. Metis: A software package for partitioning unstructed graphs,partitioning meshes, and computing fill-reducing orderings of sparse matrices [EB/OL].http://www.users.cs.umm.edu/karypis/metis/metis.html,1998.
    [143] Hendrickson B, Leland R. The Chaco user’s guide, version2.0[R]. New Mexico: SandiaNational Labs, Albuquerque, NM,1995.
    [144] Alpert C J, Caldwell A E, Markov I L, et al. Hypergraph partitioning with fixed vertices[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2000,19(2):267-272.
    [145]沈轶炜,曾国荪.异构计算中一种图的非均衡划分算法[J].计算机科学,2006,33(6):260-263.
    [146]郑凯.对数据在异构多核处理器模拟器中进行任务划分的研究[D].上海:上海交通大学,2008.
    [147]曾国荪,陆鑫达.异构计算中的负载共享[J].软件学报,2000,11(4):551-556.
    [148] Zhang Dan, Zhao Rongcai, Han Lin, et al. A parallelization cost model for GPU [A]. In:Proceeding of International Conference on Computer and Communication Technologies inAgriculture Engineering, CCTAE2010[C]. Chengdu, China: IEEE,2010:515-519.
    [149] Zhang Dan, Zhao Rongcai, Han Lin, et al. A parallelization cost model for FPGA [J].Advanced Materials Research,2011, V181-182,:623-628.
    [150] CAPSL. Open64[EB/OL]. http://www.open64.net/,2011.
    [151] Guang R. Gao. The SGI Pro64Compiler Infrastructure [EB/OL].http://www.capsl.udel.edu/~pro64/,2011.
    [152] Wilsion R P. SUIF: An Infrastructure for Research on Parallelizing and OptimizingCompilers [A]. In: Proceeding of Conference on ACM SIGPLAN’94[C]. New York, USA:ACM,1994,29(12):31-37.
    [153] W. Blume, R.Doallo, R.Eigenmann, etc. parallel programming with Polaris [J]. IEEEcomputer,1996,12:78-82.
    [154]朱传琪,臧斌宇,陈彤.程序自动并行化系统[J].软件学报,1996,7(3):180-186.
    [155] J. Anderson. Automatic computation and data decomposition for multiprocessors [D].Stanford University, Stanford, CA,1997.