面向CFD的交互式并行化技术研究

英文题名：Research on Interactive Parallelization for CFD
作者：丁晓宁
论文级别：博士
学科专业名称：计算机应用
中文关键词：并行处理 ; 自动并行化 ; 计算流体力学 ; 交互式 ; 相关分析 ; 区域计算模型 ; 区域相关 ; 通信令牌 ; 对象树
英文关键词：Parallel Computing ; Automatic Parallelization ; CFD ; Interactive ; Dependency Analysis ; Domain-Computing Model ; Domain Dependency ; Communication Token ; Object Tree
学位年度：2002
导师：朱怡安
学科代码：081203
学位授予单位：西北工业大学
论文提交日期：2001-10-01

摘要

程序自动并行化技术一直是并行处理领域的研究热点与难点，目前虽然已取得了长足进步，但实际应用效果还不理想。我们以计算流体力学CFD为应用领域对程序自动并行化技术进行了长期研究，针对银河-Ⅲ巨型机开发的程序自动并行化系统NPUPAR取得了较好的应用效果。在此基础上，国防科技大学并行与分布处理国家重点实验室建议我们扩充交互功能，进一步提高并行效率。本论文就是围绕国防重点实验室基金项目《面向计算流体力学的交互式并行化技术研究》(编号99JS94．6．1．HK0313)展开的。
     本文的主要工作和贡献如下：
     1、提出并建立了CFD程序并行化的区域计算模型。同传统的程序模型以及CFD程序的帧迭代模型和场循环模型相比，区域计算模型能够更好地反映CFD程序的结构特点，使得对CFD程序的深入分析成为可能。不但有利于保证并行程序正确性，而且有利于开发并行性，减少通信与同步，提高并行程序效率。
     2、在传统相关性理论的基础上提出了区域相关(包括与循环无关的区域相关和跨循环的区域相关)的概念，研究了区域相关的测试算法。区域相关以区域操作为基本单位。区域操作本身所具有的对大块数据进行整体操作的特点，使得区域相关非常适合开发CFD程序中蕴含的数据并行性；同时，区域操作本身的灵活性又使自动并行化系统不但可以在全局范围内进行整体分析，而且可以深入到帧迭代和场循环内部进行较为细致地分析，进而生成更高质量的并行程序。
     3、[Wolfe96]提出了使用FUD链识别分析递归变量的方法。本论文在区域和区域相关概念的基础上根据CFD程序的特点对FUD链予以扩充，提出了识别分析归约和归约变量的EFUD方法。使用EFUD方法不但可以识别复杂的标量归约，而且可以识别数组归约。
     4、研究了基于区域相关的通信判定方法，给出了判定通信的条件以及确定通信数据范围(通信区域)的公式，并以此为基础提出了基于通信令牌的优化策略。通信令牌是包含通信区域、通信的方向等信息的多元组。使用通信令牌便于通过交互进行手工优化。基于通信令牌的优化策略通过合并通信令牌降低通信量，减少通信的次数；通过寻找通信令牌最多的语句对象确定通信语句的位置以降低同步次数。

    摘要
     5、为了提高系统交互效率，提出了程序对象树结构。使用程序对象树结构
    组织并行化系统的内部数据和程序代码，便于交互时信息的定位与综合，并且有
    利于实现增量分析。
     本论文成功地实现了交互式并行化系统Paractive。经国防科技大学并行与分
    布处理国家重点实验室提供的CFD实际算例测试，Paractive生成的并行程序在
    银河－Ill并行计算机上4机并行效率达到80％，8机并行效率达到70％。q
As a hot research topic in parallel computing area, automatic parallelization is attracting more and more researchers. Much progress in exploiting coarse-grain parallelism has been made in recent years, but application results are still disappointing, with many programs achieving little or no speedup while executed in parallel. Our team has accumulated much experience after several years of intensive research on the automatic parallelization for CFD(Computational Fluid Dynamics). The automatic parallelizer NPUPAR we developed for YH-III super computer has shown promising result. Considering our achievements, National Laboratory of Parallel and Distributed Processing(PDL) at National University of Defense Technology recommended that we carry out the research on interactive parallelization to get better preference. This paper is just supported by the Defense Science Foundation to carry out the research on the interactive parallelization for CFD.
    The following is the achievements of the paper:
    1. The paper proposes a Domain-Computing Model for the parallelization of CFD programs. This model fits structural features of CFD programs better comparing with existing ones, including Frame-Iteration Model and Field-Loop Model that we proposed earlier. The Model enables the in-depth analysis, which helps to ensure the correctness and to improve the efficiency of the resultant parallel programs.
    2. The paper introduces domain dependency for domain operations. Domain operation is a kind of uniform operation over a data set, which contains data parallelism. This character of domain operation makes domain dependency suitable to the parallelization of CFD programs. Meanwhile, taking advantage of the flexibility of domain operations, parallelizer can carry out not only global analysis on frame-iteration scale and field-loop scale, but also in-depth and detail analysis on various smaller scales. This helps the parallelizer to generate more efficient parallel programs.
    3. [Wolfe96] proposed a method that recognize induction variables using FUD chains. This paper presents a new method, called EFUD method, to recognize
    HI


    reduction and reduction variables. This method reconstructs FUD chains using the concepts of domain and domain dependency. The new graphs after reconstruction is called EFUD chain. EFUD method is just tailored to the character of CFD program, whose dependencies are generally uniform. Even array reduction, as well as complex scalar reduction, can be recognized by this method.
    4. The paper studies the way to find communications that should appear in **
    parallel programs based on domain dependency. Conditions of communication and formulae for computing the region of those data that require communication are given.
    Then the paper present a strategy to optimize communication using communication tokens. Communication token is a tuple including communication domain, communication direction, and so on. Here, communication domain is the term to represent above data area. Communications can be optimized manually in interactive mode more easily. The strategy reduces the amount and times of communication by merging the tokens, and minimizes the number of synchronizations by placing communications beside the statement which possesses most tokens.
    5. In order to make interaction more productive, the paper constructs a tree framework to organize data used by parallelizer and corresponding analyzing programs. We call this kind of framework as object tree. Object tree simplifies the procedures of data location, integration, and conversion. Moreover, the tree framework facilitates incremental analysis in order to shorten parallelizer 's responding time.
    We developed a parallelizer Paractive to prove our thoughts, and tested Paractive using CFD programs from PDL. Parallel programs obtained can achieve 80% when executed on 4 computing nodes, and 70% on 8 nodes of YH-III super computer.

引文

[Adve95] V.S. Adve, J. Mellor-Crummey, et al. "An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs." InProceedings of Supercomputing 1995, San Diego, CA,November 1995.
    [Allen81] F.E.Allen, J. Cocke, and K.Kennedy (1981). Reduction of Operator Strength. In Program Flow Analysis: Theory and Applications, Steven S. Muchnick and Neil D. Jones, eds. Englewood Cliffs, N.J.: Prentice-Hall, 79--101.
    [Allen84] J.R. Allen, K. Kenndy. "PFC: A Program to Convert Fortran to Parallel Form, Supercomputers: Design and Applications." IEEE Computer Society Press, 1984, pp 186-205.
    [Ammar90] Z.Ammarguellat, W. Harrison, "Automatic recognition of induction variables and recurrence relations by abstract interpretation." In Proceedings Of the SIGPLAN '90 Conference on Programming Language Design and Implementation(PLDI). White Plains, NY, Jun 1990.
    [Amar96] S.P. Amarasinghe, J. M. Anderson, et al. "Multiprocessors from a software perspective." IEEE Micro, 16(3), pages 52-61, June 1996.
    [Apr] Applied Parallel Research. Documentation for FORGE Programming Tools. http://www.apri.com/document.html.
    [ASCI01] ASCI Home. http://www.dp.doe.gov/asc/home.htm. Sep 2001
    [Ball90] A.R. Ballance, A.B. Maccabe, et al. "The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages." ACM SIGPLAN '90 Conference on Programming Language Design and Implementation. pp:257—271. Jun 1990.
    [Bane88] U.Banerjee. "Dependence analysis for supercomputing." Norwell. Mass. Kluwer Academic Publishers.
    [Bane93] U. Banerjee, R. Eigenmann, et al. "Automatic program parallelization." In Proceedings of the IEEE, Volume 81(2). pages 211-243, February 1993.
    [Blume92] W.Blume and R.Eigenmann. "Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs,".IEEETrans. Parallei and Distributed Systems,3(6).pp.643-656,Nov. 1992.
    [Blume94] W.Blume, R.Eigenmann. "The Range Test: A Dependence Test for Symbolic, Non-linear Expressions." Proceedings of Supercomputing '94. pp: 528—537, November 1994.
    [Blume95] W. Blume, R. Eigenmann, et al. "Effectiveautomatic parallelization with Polaris." In International Journal of ParallelProgramming, May 1995.


    [Blume96] W.Blume, R.Doallo,et al. "Parallel Programming With Polaris. "Computer, 29(12), pp.78-82, Dec. 1996.
    [BEF+96] W. Blume, R. Eigenmann, K. Faigin, et al. "Restructuring programs for high-speed computers with Polaris." ICPP Workshop on Challenges for Parallel Processing. Pp 149-162, August 1996.
    [BPRE95] B. Pottenger and R. Eigenmann, "Parallelization in the presence of generalized induction and reduction variables." In Proceedings of the 1995 ACM International Conference on Supercomputing, June 1995.
    [Calv96] C. Calvin, L.Colombet. "Performance evaluation and modelling of collective communications on Cray T3D." Parallel Computing, 1996(22),pp1413:1427
    [Chap93] M.B. Chapman, et al. (1993). "Dynamic data distributions in Vienna Fortran." Supercomputing '93. pp:284-293.
    [Chen99] 陈文光，杨博，王紫瑶等．《一个交互式的FORTRAN 77并行化系统》．软件学报．10(12)．PP：1259-1267．Dec，1999
    [CBT+93] M.B. Chapman, F. Thomas,et cl. "Automatic support for data distribution on distributed memory multi-processor systems." Workshop on Languages and Compilers for Parallel Computing 1993. pp: 184-199.
    [CCIC] the Committee on Computing, Information, and Communications. "National Science and Technology CouncilHigh Performance Computing and Communications: Advancing The Frontiers of Information Technology." http://www. itrd.gov/pubs/blue97/index.html.
    [Chen89] M. Chen, M. Choo, J. Li. "Theory and pragmatics of compiling effiecient parallel code." Technical Report of YALEU/DCS/R_760, Yale University, December 1989
    [Chen98] 陈水福，孙炳楠，唐锦春．《建筑绕流风场的并行化数值模拟》．浙江大学学报．1998(32)，5．pp526-533
    [CLWZ99] 蔡国飙，刘世俭，王慧玉，庄逢甘．《真空羽流场的DSMC并行数值模拟》．航空动力学报．1999(14)，2
    [Coop93] K. Cooper, M. W. Hall, et al. "The ParaScope parallel programming environment." In Proceedings of the IEEE, Volume 81, Issue 2, pages 244—263,February 1993.
    [Cytr93] R. Cytron, F. Jeanne,et cl. "Efficiently Computing Φ-Nodes on the Fly." Workshop on Languages and Compilers for Parallel Computing. Pp:461—476.1993.
    [Dai00] 戴梅萼．《美国ASCI计划的实施》．计算机世界．2000年第49期
    [Dirk98] R. Dirk, V.D. RaVel. 《并行计算机和计算流体力学并行算法》．力学进展，

    28(1),Feb.25,1998.
    [Eige98] R. Eigenmann, J.hoeflinger, D. Padua."On the Automatic Parallelization of the Perfect Benchmark." IEEE Transaction on Parallel and Distributed System, 1998(9), 1 ,pp86-94
    [Feng99] 冯百明．《基于分区的自动并行化程序重构技术研究》．西北工业大学博士论文，1999．
    [Frum98] M.Frumkin, M. Hribar, et al. "A Comparison of Automatic Parallelization Tools/Compilers on the SGI Origin 2000." Proc. SC98,http://www. supercomp.org/sc98/TechPapers/sc98_FullAbstracts/Hribar1 140/INDEX.HTM
    [Fudx93] 傅德薰．《流体力学数值模拟》．国防工业出版社，1993．
    [Geis94] Al Geist, Adam Beguelin, et al. "PVM3 User's Guide And Reference Manual." Ork Ridge National Library. September, 1994
    [GBDJ+94] Al Geist,Adam Beguelin, et al. "PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing." The MIT Press, 1994
    [GuLi95] J.Gu, Z.Li, et al, "Symbolic array dataflow analysis for array privation and program parallelization," Supercomputing, 1995
    [Guo+01] 郭照立，施保昌，王能超．《基于区域分裂的非均匀 Lattice Boltzmann方法》．计算物理．2001．3，18(2)．pp：181-184
    [Hagh92] M.Haghighat, R.Mohammad, et al. "Symbolic Program Analysis and Optimization for Parallelizing Compilers." Workshop on Languages and Compilers for Parallel Computing. Pp538—562.1992
    [Hail93] M.W.HalI, T. J. Harvey, et al. "Experiences using the ParaScope Editor: an interactive parallel programming tool." In Proceedings of the Principles and Practices of Parallel Programming '93, pp. 33-43, May 1993.
    [Hall95] M.W.HalI, S. P. Amarasinghe, et al. "Detecting coarse-grain parallelism using an interprocedural parallelizing compiler." In Proceedings of Supercomputing '95, San Diego, California, December 1995.
    [Hall96] M.W. Hall, J. M. Anderson, et al, "Maximizing muitiprocessor performance with the SUIF compiler." IEEE Computer, 29(12), Dec, 1996, pp. 84-89
    [Hira92] S.Hiranandani, K. Kennedy, C.-W. Tseng. "Compiling Fortran D for MIMD distributed-memory machines." Comm. ACM, 1992(35),8,pp66-80
    [HPCC01] NASA High Performance Computing and communications(HPCC) Program. http://hpcc.arc.nasa.gov/index.htm. Mar,2001
    [HPF97] High Performance Fortran Forum. "High Performance Fortran Language Specification, Version 2.0." ftp://softlib.rice.edu/pub/HPF. January 1997,
    [Hock93] R.W. Hockey. "A Framework for Benchmark Performance Analysis." Computer Benchmarks, Advances in Parallel Computing. Vol 8.pp:65-76. 1993


    [Iero96] C.S. Ierotheou, S. P. Johnson, et al. "Computer Aided Parallelisation Tools (CAPTools)-Conceptual Overview and Performance on the Parallelisation of Structured Mesh Codes," Parallel Computing, vol. 22, pp. 163-195, 1996
    [Jaya97] D.N. Jayasimha, M. E. Hayder, S. K. Pillay. An evaluation of architectural platforms for parallel Navier-Stokes computations. The Journal of Supercomputing, 1997,11,pp41-60
    [Jin+96] 金国华，陈福接．《程序并行化技术与工具》．计算机研究与发展．33(7)，pp：481-491．Jul,1996
    [John+98] S.P. Johnson, P. F.Leggett, C.S.lerotheou, et al. "Computer Aided Parallelisation Tools (CAPTools) User Manual." http://captools.gre.ac.uk. Oct, 1998
    [JRJX00] 《今日巨型》．计算机世界．2000年第45期
    [KaiH92] KaiHwang, 《高等计算机系统体系结构》，王鼎兴等译，清华大学出版社，1992年
    [Kang95] 康继昌．《流场计算程序并行化方法的研究》．航空学报．1995．8，14(8)
    [Kang96] 康继昌，朱怡安等．《现代并行计算机原理》．西北厂业大学出版社．1997年6月
    [Kenn94] K. Kennedy. "Compiler technology for machine independence parallel programming." International Journal Parallel Programming, 1994, 22(1), pp79-96
    [KBCD+74] D.Kuck, P. Budnik, S.C.Chen. E,Davis Jr., et al. "Measurements of Parallelism in Ordinary FORTRAN Programs." Computer,7(1),pp,37-46,Jan, 1974.
    [Kuck72] D.Kuck, Y. Muraoka, et al. "On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup." IEEE Trans. On Computers. December 1972.
    [Koe194] C. Koelbel, D. Loveman, et al. "The high performance fortran handbook." The MIT Press, 1994
    [Kuan99] 况正谦．《程序自动并行化的场循环相关分析与优化技术》．西北工业大学博士论文，1999．
    [Kuck97] Kuck & Associates Incorporated, Documentation for the KAP/Pro Toolset. http ://www. kai.com/kpts/, 1997
    [LiMQ98] 李晓梅，莫则尧，乔香珍．并行计算时间模型研究．计算机工程与科学，1998(20),8,pp18-27
    [LiBi00] 李必信，郑国梁等．《一种分析和理解程序的方法一程序切片》．计算机研究与发展，37(3)，Mar.2000．


    [LinY01] 林奕．《流场计算程序自动分析与流程生成系统》．西北工业大学硕士论文．2001
    [LiC+90] J. Li, M. Chen. "index domain alignment: minimizing cost of crossreferencing between distributed arrays." Frontiers 90: The 3rd symposium on the frontiers of massively parallel computation. Oct, 1990
    [LiC+91] J. Li, M. Chen. "compiling communication-efficient programs for massively parallel machines." IEEE trans, on parallel and distributed systems. 2(3), pp361-376, Jul 1991
    [Liao99] S.W. Liao, A.Diwan, et al. "SUIF Explorer: an Interactive and Interprocedural Parallleizer. "Proc. Seventh ACM SIGPLAN,PPOPP, pp.37-48,May 1999
    [Liao00] S.W. Liao. "Suif explorer: an interactive and interprocedual parallelizer." Ph.D thesis of stanford university. Aug,2000
    [Mayd93] D.E,Maydan, S.P. Amarasinghe. "Array data-flow analysis and its use in array privatization," Principles of Program Language, 1993
    [Mcki94] K.S.McKinley. "Automatic and Interactive Parallelization." PhD thesis of Rice University, 1994.
    [Moon00] Sungdo Moon, Byoungro So, Mary W. Hall. "Evaluating Automatic Parallelization In Suif." IEEE Transaction on Parallel and Distributed Systems. 11(1), January 2000.
    [MPIF95] Message Passing Interface Forum. "MPI: A Message-Passing Interface Standard." June 12, 1995
    [MPIF97] Message Passing Interface Forum. "MPI-2:Extansions to the Message-Passing Interface." July 18, 1997.
    [Padua86] D. Padua, M. Wolfe. "Advanced compiler optimizations for supercomputers." In the Communications of the ACM, Volume 29, Issue 12, ACM. December 1986, pp:1184-1201.
    [Pott95] B. Pottenger, R.Eigenmann. "Idiom recognition in the Polaris parallelizing compiler." Proceedings of Supercomputing'95. ACM Press, 1995.444～448
    [PSRC98] Pacific Sierra Research Corp. http://www, psrv. com/vast.html
    [Pugh92] W. Pugh, D.Wonnacott. "Eliminating false data dependence using the omega test. "Proc. of the SIGPLAN 1992 Programming Language Design and Implementation, 1992,p140-151
    [Roge94] A.Rogers, K.Pingali. "Compiling for distributed memory architectures." IEEE Trans. on Parallel and Distributed Systems. March 5(3).pp281-298,1994
    [Hira92] S. Hiranandani, K. Kennedy, C.W. Tseng. "Compiling Fortran D for MIMD distributed-memory machines." Communications of the ACM, 35(8), pages 66-80, August 1992.


    [Shen97] 沈志宇、廖湘科、胡子昂．并行程序设计．国防科技大学出版社．1997年9月第1版
    [TCR+96] R.Thakar, A. Choudhary, J. Ramanujam. Efficient algorithms for array redistribution. IEEE Transaction on Parallel an Distributed Systems, 1996(7),6,pp587-593
    [Wang98] 王鼎兴，郑纬民，沈美明．并行机群的若干关键技术．清华大学学报(自然科学版)．4(28)．1998．pp15-22
    [Wang99] 王诚，臧斌宇等．并行化编译中递归标量的优化处理．软件学报．10(1)．pp100-106．Jan，1999
    [Wils94] R.P. Wilson, R. S. French, et al. "SUIF: An infrastructure for research on parallelizing and optimizingcompilers." ACM SIGPLAN Notices, 29(12):31-37, December 1994.
    [Wolfe82] M.Wolfe. "Optimizing supercompilers for supercomputers". PhD thesis of Univ. Of Illinois at Urbana-Champaign, October 1982
    [Wolfe87] M.Wolfe, U.Banerjee. "Data dependence and its application to parallel processing. "International Journal of Parallel Programming, 16(2):137-178, April, 1987
    [Wolfe92] M.Wolfe. "Beyond Induction Variables." ACM SIGPLAN '92 Conference on Programming Language Design and Implementation. pp162--174. June, 1992
    [Wolfe96] M.Wolfe. "High Performance Compliers for Parallel Computing". Addison-Wesley Publishing Company. 1996
    [Xiao96] 肖骊．《基于数组共享变量的程序自动并行化研究》，西北工业大学博士论文，1996．
    [Xiao97] 肖骊，康继昌．《流场类问题并行化中数组共享变量的自动搜索》．软件学报．1997(8)，11，pp871-874
    [XK95] Xiao Li, Kang Jichang, et al. "A efficient parallel algorithm of Gauss-Jordan Elimination," International Workshop. on Advanced Parallel Processing Technologies. Beijing, China. pp. 184-188. Sept. 1995.
    [XK96] 肖骊，康继昌．《自动寻找 SPMD 模型公共变量的一种方法》．西北工业大学学报，14(3)：492-494，1996
    [XKK97] 肖骊，况正谦，康继昌．《流场计算程序自动并行化的帧同步和帧通信策略》航空学报．1997(18),5，pp523-526
    [Xuyun97] 徐云，赵宁．《CFD 问题显式有限差分方法的并行化》．南京航空航天大学学报．1997(29),5，pp716-720
    [Yan+00] Guanwu Yan, Li Yuan. "Lattice Boltzmann model for the perfect gas flows with near-vacuum region." Communications in Nonlinear Science&Numerical Simulation. Jun 2000. pp58-63


    [Yan+00] Guangwu Yan, Li Yuan. "Lattice Boltzmann model for the perfectgas flows with near-vacuum regeion." Communications in Nonlinear Science&Numerical Simulation. Jun 2000. pp: 58-63
    [Yang00] 阳雪林，于勐，陈道蓄，谢立．《自动并行编译新技术》．软件学报．2000，11(9)：1268～1275
    [Zhang97] 张博，周兴社，康继昌，谷建华．《面向流场计算的 PVM 并行程序设计研究》．航空学报，1997(18)，3，pp341-344
    [Zhang98] 张宝琳，徐涛，杨烨，乔香珍．《适合在分布存储大规模并行处理系统上应用的块 ADI 算法》．高技术通讯．1998．2
    [Zhang00] 张亚棣．《面向流场计算的大规模并行处理系统硬件平台的研究》．西北工业大学博士学位论文．2000．8
    [Zhu96] 朱传琪，臧斌宇，陈彤．《程序自动并行化系统》．软件学报，1996(7)，3,PP180-185

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700