用户名: 密码: 验证码:
并行计算时间与存储空间关系研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
作为解决大规模计算问题的重要手段,高性能计算被越来越广泛的应用到科学与工程的各个领域,人们对其效率的要求也越来越高。面对庞大、复杂且对时效性要求极高的计算任务,如何优化并行程序设计,提高系统性能是高性能计算领域有待突破的重点和难点问题。要解决此类问题,首先需要解决的就是高性能计算中的性能评价。
     并行程序的设计与优化是个非常复杂的过程。在并行程序的开发过程中,时间需求和存储需求是其中必须要考虑的重要问题;进一步合理解决计算时间和存储空间的关系,也是并行程序性能优化的有效途径。本文结合“飞行器RCS(Radar Cross-Section)精确数值计算并行化研究及实现”项目,紧紧围绕并行计算时间与存储空间关系这一主题,对时间与空间评测标准、时间与空间之间的关系、并行程序时间开销及其处理器规模的计算方法进行了深入研究。本文的主要工作包括:
     1、提出时间加速模型与空间加速模型
     针对并行程序特点,调整了加速比性能定律,并称为时间加速模型。该模型论证了并行计算中时间加速的存在性,在时间加速比中加入了空间因素。同时,分析了时间效率和计算时间在并行程序优化后的变化规律。
     分析了并行计算中存储空间的变化特点,提出了空间加速模型,确定了存储空间在并行计算中的基本特征。为了获取空间加速模型中所需的空间参数,提出了两种空间统计策略。一种用于统计并行程序运行时对总存储空间需求量的峰值,一种用于统计节点内对存储空间需求量的峰值。
     2、提出时间与空间关系模型及其预测方法
     分析了时间与空间之间的四种关系,并给出相应的时间效率和空间效率关系图。通过时空效率关系图,寻找即能充分发挥系统计算能力又能缩短计算时间的平衡点。
     提出用空间表示时间的计算模型。该模型采用了相对简单的方法,能够付出较小的代价计算出具体处理器规模下关键存储空间的处理时间,用于研究关键存储空间对并行程序整体性能的影响,为预测时空关系提供了可能性。
     3、提出并行程序时间开销模型及其处理器规模计算方法
     针对分布存储、共享存储、分布式共享存储并行处理机的体系结构,研究了MPI、OpenMP和MPI+OpenMP并行程序的时间开销模型。尤其是对MPI+OpenMP程序时间开销的研究,揭示了混合编程模型的时间开销来源和各时间开销之间的关系。
     分析了OpenMP程序特点,指出使用OpenMP进行编程时,需要重新考虑其处理器规模的确定方法。根据采用并行计算后程序规模的膨胀情况,提出OpenMP程序和MPI+OpenMP程序处理器规模的计算方法。为在分布存储、共享存储、分布式共享存储体系结构下研究时空问题,分析了三者的主要差异。
High performance computing (HPC) is widely used in science and engineering fields to solve large scale computing problems. The demand for promoting its performance has rapidly increased. Facing huge, complex and time-critical computing tasks, how to optimize parallel program designing to improve parallel system performance efficiency is still an important and difficult problem in high performance computing, which remains to be broken through. To solve such problem, the performance evaluation of high performance computing is the one, which should be solved firstly.
     It is a complicated process to develop and optimize parallel program. Time requirement and storage requirement are two important aspects that should be taken into consideration during the programming process. Refining the relationship between time and storage space is also an efficient method to optimize parallel program. In this dissertation, the relationship between time requirement and storage requirement is deeply analyzed. The theories on this topic, such as evaluation metric for time and space, the relationship between time and space, and the overhead model of parallel program and its computing processor scale’s calculating method are studied. The actual practice of these theories and analyze are carried out in the project-‘the study and implementation of aircrafts RCS exact parallel computing’. The main contributions of this dissertation are as follows.
     1. The time speedup model and space speedup model are raised.
     Based on the parallel program’s characteristics, the speedup is adjusted, and renamed as time speedup model. In this model, the existence of the time speedup in parallel computing is discussed, and the spatial factor is placed into it. Meanwhile, the changing trends of time efficiency and computing time after parallel program optimization are analyzed.
     The changing regularity of storage requirement in parallel computing is analyzed, and the space speedup model is proposed. In this model, the primary characteristic of storage requirement in parallel computing is identified. And in order to obtain the parameters used in the space speedup model, two storage space statistics are proposed. One can be used to survey the total storage requirement of a parallel program, and another to census the storage requirement in a certain node.
     2. The relationship model between time and space and the relationship prediction method are proposed.
     Four kinds of time and space’s relationship are analyzed, and the relative time efficiency and space efficiency diagrams are given. Using the time efficiency and space efficiency diagram, the equilibrium point to make effective use of system resources and shorten the calculating time can be found.
     A calculating model to use space to express time is presented. Using a relatively simple way, the processing time of the key storage space under certain parallel processor scale can be calculated with little overhead. It can be used to study the key storage space’s effect on parallel program’s whole performance and to provide the possibility to predict the relationship between time and space.
     3. The overhead model of parallel program and its processor scale’s calculating method are raised.
     In view of the distributed memory, shared memory and distributed shared memory systems’structures, the overhead models of MPI program, OpenMP program and MPI+OpenMP program are studied. Especially, the MPI+OpenMP program’s overhead model reveals the mixed program’s overhead sources and the relationship between them.
     The characteristic of OpenMP program is analyzed, pointing out that the method to determine processor scale, when using OpenMP to develop programs, should be reconsidered. The strategies to calculate the processor scales of OpenMP program and MPI+OpenMP program are proposed, making use of the expansion condition of program size. Thus, the main difference in discussing the time and space problem under distributed memory, shared memory and distributed shared memory systems is analyzed.
引文
[1]陈国良,吴俊敏,章锋,章隆兵.并行计算机体系结构.高等教育出版社, 2002.
    [2] Flynn M. J. Very high-speed computing systems. Proc of the IEEE, 1996, 54(12): 1901-1909.
    [3]都志辉.高性能计算并行编程技术—MPI并行程序设计.北京:清华大学出版社, 2001.
    [4]陈国良.并行计算-结构.算法.编程.高等教育出版社, 2003.
    [5]陈国良.并行算法的设计与分析.高等教育出版社, 2002.
    [6] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walkler and Jack Dongarra. MPI The Complete Reference Volume 1, The MPI Core. London, England. The MIT Press, 1999.
    [7]陈国良,安虹,陈崚,郑启龙,单久龙.并行算法实践.高等教育出版社, 2004.
    [8] G. M. Amdahl. Validity of the Single Processor Approach to Achieving Large Seale Computing Capabilities. In Proceedings of the AFIPS Spring Joint Computer Conference,Pages483-85. ACM, 1967.
    [9] J. L. Gustafson. Reevaluating Amdahl’s Law. Communications of the ACM, 31(5):532-533, 1988.
    [10] Xiao-He Sun and Lionel M. Ni. Another View on Parallel Speedup. Proc Supercomputing’90, 1990, 324-333.
    [11] V. Kumar and A. Gupta. Analyzing Scalability of Parallel Algoritms and Architectures. J. Parallel Distrib. Computing 22, 379-391 (1994).
    [12] R. Barik and V. Sarkar. Interprocedural Load Elimination for Dynamic Optimization of Parallel Programs. Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on Digital Object Identifier: 10.1109/PACT. 2009. pp.41-52, 2009.
    [13] Vazquez, E. Antelo and P. Montuschi. Improved Design of High-Performance Parallel Decimal Multipliers. Computers, IEEE Transactions on Volume, Vol. 59. No. 5. pp. 679-693, 2010.
    [14] D. Becker, W. Frings and F. Wolf. Performance Evaluation and Optimization of Parallel Grid Computing Applications.16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp.193-199, 2008.
    [15] R. D. Blumofe and C. E. Leiserson. Space-Efficient Scheduling of Multithreaded Computing. Proc. 25th Ann. ACM Symp. Theory of Computing, San Diego, Calif., pp. 362-371, ACM, 1993.
    [16] F. W. Burton. Guaranteeing Good Memory Bounds for Parallel Programs. IEEE Transactions on software engineering. Vol. 22. No. 10. October 1996.
    [17] L. Y. Chen, W. Denzel and R. Luijten. Optimization of link bandwidth for parallel communication performance. 2009 IEEE 28th International Performance Computing and Communications Conference(IPCCC). pp. 137-144. 2009.
    [18] A. Mazouz, S. A. A. Touati and D. Barthou. Study of Variations of Native Program Execution Times on Multi-Core Architectures. 2010 International Conference on Complex, Intelligent and Software Intensive Systems(CISIS). pp.919-924, 2010.
    [19] Lei Hu, Ian Gorton. Performance Evaluation for Parallel Systems: A Survey. Technical Report UNSW-CSE-TR-9707, University of NSW, Sydney, Australla,Oetober 1997.
    [20] Brewer E. A, Dellaroca C. N, Colbrook and Weihl W. E. PROTEUS: A high Performance Parallel-architecture simulator. Performance Evaluation Review, vol.20,no.1, PP247-248, June1992.
    [21] Jain R. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. JohLn Wiley & Sons, NewYork, 1991.
    [22] Ferrari D, Serazzi G and Zeigner A. Measurement and Tuning of Computer Systems. Prentiee-Hall, Englewood CliffS, NJ, 1983.
    [23] Jeffrey S. Vetter, Theresa L. Windus and Brent Gorda. Performance Metrics for High End Computing. HECRTF White Paper, 2003.
    [24] Computing Research Association. The Roadmap for the Revitalization of High-End Computing, 2003.
    [25] National Coordination Offer for Information Technology Research and Development (NCO/ITR&D). Report of the High-End Computing Revitalization Task Force (HECRTF), 2004.
    [26]迟利华,刘杰,胡庆丰,李晓梅.基于并行Benchmark的高性能机实用测试与评价方法.长沙:计算机工程与科学, 2004, 26(4):45-47.
    [27] Lo J., Egger S., Emer J., Levy H., Stamm R. and Tullsen D. Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, August 1997.
    [28] Gibson J., Kunz R., Ofelt D., Horowitz M., Hennessy J. and Heinrich M. FLASH vs. (Simulated) FLASH: Closing the Simulation Loop. Proceedings of the 9th Intemational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 49-58, November 2000.
    [29] R. Fujimoto. Parallel and Distributed Simulation Systems. Wiley Interscience, January, 2000.
    [30] A. Jaequet, V. Janot, R. Govindarajan, C. Leung, G. Gao and T. Sterling. An Executable Analytieal Performance Evaluation Approach for Early Performance Prediction. Proeeedings of IPDPS03, 2003.
    [31] A. Hoisie, O. Lubeek and H. Wasserman. Performance and Scalability Analysis of TerafloP-Scale Parallel Architectures Using Multidimensional Wavefront Applications. The International Journal of High Performance Computing Applications, vol.14, no.4, 2000.
    [32] D. J. Kerbyson, H. Alme, A. Hoisie, F. Petrini, H. Wasserman and M. Gittings. Predictive Performance and Scalability Modeling of a Large-Scale Application. Proceedings of SC2001, IEEE, November 2001.
    [33] Darren J. Kerbyson, Adolfy Hoisie and Shawn D. Pautz. Performance Modeling of Deterministic Transport Computations. Performance Analysis and Grid Computing, Kluwer: Dordrecht, 2003.
    [34] Fabrizio Petrini, Darren J. Kerbyson and Scott Pakin. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCIQ. Proc. SC2003, 2003.
    [35] D. J. Kerbyson, H. J. Wasserman and A. Hoisie. Exploring Advanced Architectures using Performanee. Innovative Architecture for Future Generation High-Performance Processors and Systems, IEEE Computer Society Press, 2002.
    [36] D. J. Kerbyson, H. Alme, A. Hoisie, F. Petrini, H. Wasserman and M. Gittings. Predietive Performance and Sealability Modeling of a Large-Scale Appplieation. Proeeedings of SCZO01, IEEE, Nov. 2001.
    [37] Darren J. Kerbyson, Adolfy Hoisie and Harvey J. Wasserman. Verifying Large-Scale System Performance During Installation using Modeling. In L T Yang, YPan (eds): in High Performance Scientific and Engineering Computing, Hardware/Software Support, 143-156, October 2003.
    [38] Adolfy Hoisie, Olaf M. Lubek and Harvey J. Wasserman. Performance Analysis of Wavefront Algorithms on Very-Large Scale Distributed Systems. In Lectures Notes in Control and Information Sciences, 249171, 1999.
    [39] M. Mathis, D. Kerbyson and AHoisie. A Performance Model of Non-Deterministic Partiele Transport on Large-scale Systems. Workshop on Performance Modeling and Analysis, ICCS, Melbourne, June 2003.
    [40] Song J. M. and Chew W. C. Multilevel fast-multipole algorithm for solving combined field imtegral equations of electromagnetic scattering. Microwave and Optical Technology Letters, 1995, 8(1): 14-19.
    [41] Velamparambil S., Chew W. C. and Song J. 10 million unknown: Is it that big?. IEEE Antennas and Propagation Magazine, 2003, 45(2): 43-58.
    [42] Hastriter M. L. A study of MLFMA for large-scale scattering problems. PhD thesis, University of Illinois at Urbana-Champaign, 2003.
    [43] Sylvand G. Performance of a parallel implementation of the FMM for electromagnetics applications. International journal for numerical methods in fluids, 2003, 43(8): 865-879.
    [44] Velamparambil S. and Chew W. C. Analysis and performance of a distributed memory multilevel fast multipole algorithm. IEEE Antennas and Propagation, 2005, 53(8): 2719-2727.
    [45] Ergul O. and Gurel L. Efficient Parallelization of the Multilevel Fast Multipole Algorithm for the Solution of Large-Scale Scattering Problems. IEEE Transactions on Antennas and Propagation, 2008, 56(8): 2335-2345.
    [46] Robert D. B. and Charles E. L. Space-Efficient Scheduling of Multithreaded Computing. SIAM Journal on computing, 1998, 27(1): 202-229.
    [47] Gregory R. Andrews. Foundations of Multithreaded, Parallel, and Distributed Programming. Pearson Education, 2000.
    [48] D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta and E. Ayguade. Leveraging Transparent Data Distribution in OpenMP via User-level Dynamic Page Migration. WOMPEI 2000.
    [49] D. S. Nicolopoulos, Eduard Ayguade, et al. Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page migration. WOMPAT 2000, st. Diego, USA.
    [50] Dimitrios S. Nikolopoulos and Eduard Ayguadé. A Study of Implicit Data Distribution Methods for OpenMP Using the SPEC Benchmarks. WOMPAT2001.
    [51] J. A. Gonzalez, C. Leon, C. Rodriguez and F. Sande. Exploiting Nested Independent FORALL Loops in Distributed Memory Machines. EWOMP 2001.
    [52] Jonathan Harris, Peter W. Craig, RaeLyn Crowell, C. Alexander Nelson and Carl D. Offner. Experience with Data Placement Extensions to OpenMP for NUMA Architectures. EWOMP 2001.
    [53] M. Bull and C. Johnson. Data Distribution, Migration and Replication on a cc-NUMA Architecture. EWOMP 2002.
    [54] Dibyendu Das. Portable Extensions to OpenMP for 2-level Nested Parallelization in cc-NUMA and SMP Clusters. EWOMP 2001.
    [55] Geraud Krawezik, Guillaume Alleon and Franck Cappello. SPMD OpenMP versus MPI on a IBM SMP for 3 kernels of the NAS Benchmarks. WOMPEI 2002.
    [56] Geraud Krawezik and Franck Cappello. Performance Comparison of MPI and three OpenMP programming Styles on Shared Memory Multiprocessors. SPAA 2003.
    [57]陈永健. OpenMP编译与优化技术研究.清华大学工学博士学位论文, 2004。
    [58] Mitsuhisa Sato, Kazuhiro Kusano and Sigehisa Satohx. OpenMP Benchmark using PARKBENCH. WOMPAT 2000.
    [59] Daisuke Takahashi, Mitsuhisa Sato and Taisuke Boku. Performance Evaluation of the Hitachi SR8000 Using OpenMP Benchmarks. WOMPEI 2002.
    [60] Roger W. Hockney. The Science of Computer Benchmarking. The Society for Industrial and Applied Mathematics, Philadelphia. 1996.
    [61] Rudi Eigenmann, Michael Voss and Brian Armstrong. OpenMP Tools and Benchmarks. WOMPAT 2000.
    [62] Bronis R. De Supinski and Bor Chan. Towards an Integrated Parallel Microbenchmark Suite. WOMPAT 2000.
    [63] Dmitry Pekurovsky. OpenMP microbenchmarking study on IBM SP High Nodes. WOMPAT 2000.
    [64] J. Mark Bull and Darragh O'Neill. A Microbenchmark Suite for OpenMP 2.0. EWOMP 2001.
    [65] C. Calonaci, P. Malfetti, S. Campagna, P. Faggian and D. Ronzio. Parallelization of the weather forecast code Mephysto. EWOMP 2002.
    [66] G. Gazzaniga, P. Lanucara, P. Pietra, S. Rovida and G. Sacchi. Rapid parallelization of the drift-diffusion model for semiconductor devices. EWOMP 2002.
    [67] Henry Jin, Gabriele Jost, Jerry Yan, Eduard Ayguade, Marc Gonzalez and Xavier Martorell. Automatic Multilevel Parallelization Using OpenMP. EWOMP 2001.
    [68] R. Lario, C. Garcia, M. Prieto and F. Tirado. Rapid Parallelization of a Multilevel Cloth Simulator Using OpenMP. EWOMP 2001.
    [69] Eduard Ayguadé, Bob Blainey, Alejandro Duran, et al. Is the Schedule Clause Really Necessary in OpenMP? . WOMPAT 2003.
    [70] David H. Bailey and Allan Snavely. Performance Modeling: Understanding the Present and Predicting the Future. Lawrence Berkeley National Laboratory, Paper LBNL-57488. 2005.
    [71] I. J. Bush, C. J. Noble and R. J. Allan. Mixed OpenMP and MPI for Parallel Fortran Applications. http://www.ukhec.ac.uk/publications/reports/ewomp paper.pdf.
    [72] F. Cappello and D. Etiemble. MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks. Supercomputing, Dallas, 2000, http://www.sc2000.org/proceedings/tech-papr/ papers/pap214.pdf.
    [73] Chau Wen Tseng. Compiler Optimizations for Eliminating Barrier Synchronization. In Proceedings of the 5th ACM Symposium on Principles and Practice of Parallel Programming, 1995.
    [74] R. Rabenseifner and G. Wellein. Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures. EWOMP 2002.
    [75] Rolf Rabenseifner. Communication Bandwidth of Parallel Programming Models on Hybrid Architectures. WOMPEI 2002.
    [76] Yun He and Chris H. Q. Ding. An Evaluation of MPI and OpenMP Paradigms for Multi-Dimensional Data Remapping. WOMPAT 2003, LNCS 2716, pp. 195-210, 2003.
    [77] Lorna Smith, and Mark Bull. Development of mixed mode MPI/OpenMP Applications. Scientific Programming 9 (2001) 83-98.
    [78] Taisuke Boku, Shigehiro Yoshikawa, Mitsuhisa Sato, Carol G. Hoover and William G. Hoover. Implementation and performance evaluation of SPAM particle code with OpenMP-MPI hybrid programming. http://www. compunity.org/events/pastevents/ewomp 2001/sato.pdf.
    [79] A. Kneer, E. Schreck, M. Hebenstreit and A. Goszler. Industrial mixed OpenMP/Mpi CFD-application for calculations of free-surface flows. WOMPAT'2000, Workshop on OpenMP Applications and Tools, San Diego, California, 2000.
    [80] Franck Cappello and Daniel Etiemble. MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks. SC 2000.
    [81] P. Kloos, P. Blaise and F. Mathey. OpenMP and MPI programming with a CG algorithm. In Proceedings of the Second European Workshop on OpenMP (EWOMP 2000), http:// www.epcc.ed.ac.uk/ ewomp2000/proceedings.html, 2000.
    [82] Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou. OpenMP for Adaptive Master-Slave Message Passing Applications. ISHPC 2003, LNCS 2858, pp. 540-551, 2003.
    [83] Kengo Nakajima. OpenMP/MPI Hybrid vs. Flat MPI on the Earth Simulator: Parallel Iterative Solvers for Finite Element Method. ISHPC 2003, LNCS 2858, pp. 486-499, 2003.
    [84] Fabrice Mathey, Philippe Blaise and Philippe Kloos. OpenMP optimization of a parallel MPI CFD code. Second European Workshop on OpenMP, Murrayfield Conference Center, Edinburgh, Scotland, U.K., September 14-15th 2005.
    [85] Kengo Nakajima and Hiroshi Okuda. Parallel Iterative Solvers for Unstructured Grids Using an OpenMP/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures. ISHPC 2002, LNCS 2327, pp. 437-448, 2002.
    [86] Inho Park, Seon Wook Kim and Kyung Park. Characterization of OpenMP Applications on the InfiniBand-Based Distributed Virtual Shared Memory System. HiPC 2004, LNCS 3296, pp. 430-439, 2004.
    [87] Rolf Rabenseifner. Communication Bandwidth of Parallel Programming Models on Hybrid Architectures. ISHPC 2002, LNCS 2327, pp. 401-412, 2002.
    [88] George Almasi, Eduard Ayguade, Calin Cascaval, Jose Castanos, Jesus Labarta, Francisco Martinez, Xavier Martorell and Jose Moreira. Evaluation of OpenMP for the Cyclops Multithreaded Architecture. WOMPAT 2003, LNCS 2716, pp. 69-83, 2003.
    [89] Siegfried Benkner and Viera Sipkova1. Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs. International Journal of Parallel Programming, Vol. 31, No. 1, February 2003.
    [90] John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris, C. Alexander Nelson and Carl D. Offner. Extending OpenMP for NUMA machines. Scientific Programming 8 (2000) 163-181.
    [91] Jens Gerlach, Zheng-Yu Jiang and Hans-Werner Pohl. Integrating OpenMP into Janus. WOMPAT 2001, LNCS 2104, pp. 101-114, 2001.
    [92] Mitsuhisa Sato, Motonari Hirano, Yoshio Tanaka and Satoshi Sekiguchi. OmniRPC: A Grid RPC Facility for Cluster and Global Computing in OpenMP. WOMPAT 2001, LNCS 2104, pp. 130-136, 2001.
    [93] Motonori Hirano, Mitsuhisa Sato and Yoshio Tanaka. OpenGR: A Directive-Based Grid Programming Environment. ISHPC 2003, LNCS 2858, pp. 552-563, 2003.
    [94] Jaegeun Oh, SeonWookKim and Chulwoo Kim. OpenMP and Compilation Issue in Embedded Applications. WOMPAT 2003, LNCS 2716, pp. 109-121, 2003.
    [95] Aron Kneer. Industrial Mixed OpenMP/MPI CFD application for Practical Use in Free-surface Flow Calculations. WOMPAT 2000.
    [96] Dieter A. May and Stephan Schmidt. From a Vector Computer to an SMP-Cluster-Hybrid Parallelization of the CFD Code PANTA. EWOMP 2000.
    [97] Kengo Nakajima and Hiroshi Okuda. Parallel Iterative Solvers for Unstructured Grids using an OpenMP/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures. WOMPEI 2002.
    [98] Kengo Nakajima. OpenMP/MPI Hybrid vs. Flat MPI on the Earth Simulator: Parallel Iterative Solvers for Finite Element Method. WOMPEI 2003.
    [99] Lorna Smith and Mark Bull. Development of Mixed Mode MPI/ OpneMP Applications. WOMPAT 2000.
    [100] Taisuke Boku, Shigehiro Yoshikawa, Mitsuhisa Sato, Carol G. Hoover and William G. Hoove. Implementation and Performance Evaluation of SPAM Particle Code with OpenMP-MPI Hybrid Programming. EWOMP 2001.
    [101] Yun He and Chris H. Q. Ding. Mixed MPI and OpenMP Implementation in an In-place Vacancy Tracking Array Transpose Method. WOMPAT 2002.
    [102] Chen Yongjian, Wang Dingxing and Zheng Weimin. Extended Overhead Analysis for OpenMP Performance Tuning. WOMPAT 2003.
    [103] J. M. Bull. A Hierarchical Classification of Overheads in Parallel Programs. Proceedings of First IFIP TC10 International Workshop on Software Engineering for Parallel and Distributed Systems, I. Jelly, I. Gorton and P. Croll (eds), Chapman Hall, pp. 208-219, March 1996.
    [104] Achal Prabhakar, Vladimir Getov and Barbara Chapman. Performance Comparisons of Basic OpenMP Constructs. WOMPEI2 2002.
    [105] Michael K. Bane. Extended Overhead Analysis for OpenMP. In Proc. of the 8th International EuroPar Conf., Paderborn, Germany, August 2002.
    [106] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, vol. 73, pp. 325-348, 1987.
    [107] J. Carrier, L. Greengard and V. Rokhlin. A Fast Adaptive Multipole Algorithm for Particle Simulations. SIAM Journal on Scientific and Statistical Computing, vol. 9, pp. 669-686, 1988.
    [108] H. Cheng, L. Greengard and V. Rokhlin. A Fast Adaptive Multipole Algorithm in Three Dimensions. Journal of Computational Physics, vol. 155, pp. 468-498, 1999.
    [109] V. Rokhlin. Rapid solution of integral equations of classical potential theory. J. Comput. Phys., vol. 60(2), pp. 207-287, 1985.
    [110] C. C. Lu and W. C. Chew. Fast algorithm for solving Hybrid integral equations. IEEE Proc., vol. 140(6), pp. 455-460, 1993.
    [111] J. M. Song and W. C. Chew. Multilevel fast-multipole algorithm for solving combined field integral equations of electromagnetic scattering. Microwave and Optical Technology Letters, vol. 10, pp. 14-19, 1995.
    [112] J. M. Song, C. C. Lu and W. C. Chew. Multilevel fast multipole algorithm for electromagnetic scattering by large complex objects. Ieee Transactions on Antennas and Propagation, vol. 45, pp. 1488-1493, Oct 1997.
    [113] C. C. Lu and W. C. Chew. Multilevel algorithm for solving a boundary integral equation of wave scattering. Microwave and Optical Technology Letters, vol. 7, pp. 466-470, 1994.
    [114] Matthias Muller. OpenMP Optimization Techniques: Comparison of Fortran and C Compilers. EWOMP 2001.
    [115] Matthias Muller. Some Simple OpenMP Optimization Techniques. WOMPAT 2001.
    [116] Shigehisa Satoh, Kazuhiro Kusano and Mitsuhisa Sato. Compiler Optimization Techniques for OpenMP Programs. EWOMP 2000.
    [117] Tatebe, M. Sato and S. Sekiguchi. Impact of OpenMP Optimizations for the MGCG Method. WOMPEI 2000.
    [118] Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura and Hironori Kasahara. Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP. WOMPEI 2002.
    [119] O. Tatebe, M. Sato and S. Sekiguchi. Impact of OpenMP Optimizations for the MGCG Method. WOMPEI 2000.
    [120] R. Gerber, A. J. C. Bik, K. B. Smith and X. Tian. The Software OPtimization Cookbook: High-performance Recipes for IA-32 Platforms. Intel Press, 2006.
    [121]苗乾坤.面向共享存储系统的计算模型及性能优化.中国科学技术大学博士学位论文, 2010.
    [122] V. Sarkar. Optimized Unrolling of Nested LooPs. International journal of Parallel programming, 29(5):545-581, 2001.
    [123] V. H. Allan, R. B. Jones, R. M. Lee and S. J. Allan. Software Pipelining. ACM Computing Surveys (CSUR), 27(3): 367-32, 1995.
    [124] B. Rychlik, J. Faistl, B. Krug and J. P. Shen. Efficacy and Performance lmpact of Value Prediction. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 148-154, 1998.
    [125] H. Aydin and D. Kaeli. Using Cache Line Coloring to Perform Aggressive Procedure Inlining. ACM SIGARCH Computer Architecture News, 28(1):62-71, 2000.
    [126] G. Pike and P. N. Hilfinger. Better Tiling and Array Contraction for Compiling Scientific Programs. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Pages l-12, 2002.
    [127] M. D. Lam, E. E. Rothberg and M. E. Wo1f. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Pages 63-74, 1991.
    [128]侯晓吻,张林波,张云泉.万亿次机群系统高性能应用软件运行现状分析.计算机工程, 31(22):81-83, 2005.
    [129] L. Oliker, A. Canning, J. Carter, J. Shalf and S. Ethier. Scientific Computations on Modern Parallel Vector Systems. In Proceedings of the IEEE Conference on Supercomputing, pages l-10, 2004.
    [130] J. Carter, L. Oliker and J. Shalf. Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems, pages 490-503. Springer, 2006.
    [131] Harold S. Stone. High-Performance Computer Architecture. Addison-Wesley Pub. Co. Inc.,Reading, MA, 1987.
    [132] D. J. Lilja. Measuring Computer Performance. Cambridge University Press, 2000.
    [133]陈永然.面向高性能计算的性能评价模型技术研究.国防科学计算大学研究生院工学博士学位论文, 2007.
    [134] Yue Luo and Lizy K. John. Effciently Evaluating Speedup Using Sampled Processor Simulation. IEEE Computer Architecture Letters, 2004, 3(1): 6-6.
    [135] Morchen F. Analysis of Speedup as Function of Block Size and Cluster Size for Parallel Feed-Forward Neural Networks on a Beowulf Cluster. IEEE Transactions on neural networks, 2004, 15(2): 515-527.
    [136] Seyed Ghassem Miremadi, Siavash Bayat Sarmadi and Ghazanfar Asadi. Speedup Analysis in Simulation-Emulation Cooperation. IEEE International conference on field-programmable technology, 2002, pp:394-398.
    [137]陈国良,苗乾坤,孙广中,徐云,郑启龙.分层并行计算模型.中国科学技术大学学报,第38卷,第7期, 2008.
    [138]程海鹏.基于多核处理器的可扩放包分类算法研究.中国科学技术大学,硕士论文, 2009.
    [139] Zhiyuan Wang. Reliability Speedup: An Effective Metric for Parallel Application with Checkpointing. 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, 2009: 247 -254.
    [140] Xiaowen Chen, Zhonghai Lu, Axel Jantsch and Shuming Chen. Speedup Analysis of Data-parallel Applications on Multi-core Noes. ASICON'09, IEEE 8th International Conference, 2009:105 -108.
    [141]王欢,都志辉.并行计算模型对比分析.计算机科学,第32卷,第12期, 2005.
    [142]谢超,麦联叨,都志辉,马群生.关于并行计算系统中加速比的研究与分析.计算机工程与应用, 2003, 39(26): 66-68.
    [143] DENNIS SHASHA and MARC SNIR. Efficient and Correct Programs that Share Execution of Parallel Memory. ACM Transactions on Programming Languages and Systems, Vol. 10, No. 2, April 1988, Pages 282-312.
    [144] Eddy De Greef, Francky Catthoor and Hugo De Man. Memory size reduction through storage order optimization for embedded parallel multimedia. Parallel Computing, vol.23: 1811-1837, 1997.
    [145] A. Sehgal, A. Dubey, E. J. Marinissen, C. Wouters, H. Vranken and K. Chakrabarty. Redundancy modelling and array yield analysis for repairable embedded memories. IEEE Proc. Comput. Digit. Tech., Vol. 152, No. 1, January 2005.
    [146] SANDHYA DWARKADAS, HONGHUI LU, ALAN L. COX and WILLY ZWAENEPOEL. Combining Compile-Time and Run-Time Support for Efficient Software Distributed Shared Memory. PROCEEDINGS OF THE IEEE, VOL. 87, NO. 3, MARCH 1999.
    [147] Fredrik Dahlgren and Anders Landin. Reducing the Replacement Overhead in Bus-Based COMA Multiprocessors. IEEE in Proceedings of HPCA-3, 1997,14-23.
    [148] Andreas Moshovos and Gurindar S. Soni. Reducing Memory Latency via Read-after-Read Memory Dependency Prediction. IEEE Transactions on Computers, 51(3): 313-326, 2002.
    [149] Mihai Budiu and Seth C. Goldstein. Optimizing Memory Accesses For Spatial Computation. Proceedings of the International Symposium on Code Generation and Optimization (CGO’03), 2003, 216-227.
    [150] Minsu Choi and Nohpill Park. Dynamic Yield Analysis and Enhancement of FPGA Reconfigurable Memory Systems. IEEE Transactions on Instrumentation and measurement, vol. 51, no. 6, DECEMBER 2002.
    [151] Jesse Zhixi Fang and Mi Lu. An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing. IEEE Transactions on Computers, 42( 5): 529-546, MAY 1993.
    [152] Ramakrishna M, Jisung Kim, Woohyong Lee and Youngki Chung. Smart Dynamic Memory Allocator for Embedded Systems. Proc. Int. Symp. Computer Inf. Sci., 2008, pp. 1-6.
    [153] Bulent Abali, Mohammad Banikazemi, Xiaowei Shen, Hubertus Franke, Dan E. Poff and T. Basil Smith. Hardware Compressed Main Memory: Operating System Support and Performance Evaluation. IEEE Transactions on Computers, 50(11): 1219-1233, NOVEMBER 2001.
    [154] Xian-He Sun and Yong Chen, Reevaluating Amdahl’s law in the multicore era. J. Parallel Distrib. Compu. 70(2010), 183-188.
    [155] Saavedra R. H. and Smith A. J. Analysis of Benchmark Characteristics and Benehmark Performance Prediction. TOCS14(4): 344-384, 1996.
    [156] Saavedra R. H. and Smith A. J. Performance Characterization of Optimizing Compilers. TSEZI(7): 615-628, 1995.
    [157] Simon J. and Wierum J. M. Accurate Performance Prediction for Massively Parallel Systems and its Applications. Euro-Par, Vol.2, pp. 675-668, 1996.
    [158] Simon J. and Wierum J. M. Sequential Performance versus Scalability: Optimizing Parallel LU-decomposition. Proc of HPCN96 in Lecture Notes in Computer Science 1067, pp 627-632, 1996.
    [159] Crovella M. E. and LeBlanc T. J. Parallel Performance Prediction Using Lost Cycles Analysis. SC1994, pp 600-609, 1994.
    [160]陈永然,齐星云,窦文华.一个面向I/O密集型并行应用的性能模型[J].计算机研究与发展, 2007, 44(4): 707-713.
    [161]陈永然,窦文华,钱悦,齐星云.基于系统抽样的并行程序性能特征分析方法及其实现[J].计算机研究与发展, 44(10): 1694-1701, 2007.
    [162] Chen Yongran, Qi Xingyun, Qian Yue and Dou Wenhua. PMPS(3): A Performance Model of Parallel Systems. ACSAC2006, LNCS(4186): 344-350, 2006.
    [163]陈永然,齐星云,窦文华. PMPS:一个并行系统性能模型[J].计算机工程与科学,第29卷,第9期, 2007.
    [164]迟利华,刘杰,胡庆丰.数值并行计算可扩展性评价与测试[J].计算机研究与发展, 42(6): 1073-1078, 2005.
    [165]李晓梅,莫则尧,乔香珍.并行计算时间模型研究[J].计算机工程与科学, 1998, 20(3): 18-27.
    [166] ?zgür Ergül and and Levent Gürel. A Hierarchical Partitioning Strategy for an Efficient Parallelization of the Multilevel Fast Multipole Algorithm[J]. IEEE Transactions on antennas and propagation, 2009, 57( 6): 1740-1750.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700