高效并行计算系统中的计算模型与通信网络

英文题名：The Computation Models and Interconnection Networks in High Efficient and Parallel Computing Systems
作者：刘方爱
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：互连网络 ; 路由算法 ; 并行计算模型 ; RP(k)网络 ; 网络嵌入 ; 波长指派
英文关键词：Interconnection Network ; Routing Algorithms ; Parallel Computation Models ; RP(k) Network ; Network Embedding ; Wavelength Assignment
学位年度：2001
导师：刘志勇
学科代码：081201
学位授予单位：中国科学院研究生院（计算技术研究所）
论文提交日期：2001-04-01

摘要

并行处理作为一个重要的研究领域，受到研究者越来越多的重视，而提高通信效率是一个重要的研究课题。本文针对程序中的通信效率问题，分析了产生通信拥挤的三个因素，即BSP程序引起的通信拥挤、网络拓扑结构引起的通信拥挤和网络嵌入算法不当引起的通信拥挤。在此基础上，选择这三个问题作为本文的主攻方向。经过三年的研究，在阅读大量文献的基础上，取得了理想的研究成果。针对并行程序的效率，本文提出了CSA-BSP模型，该模型能够引导程序设计者写出高效率的并行程序；针对互连网络，本文提出了RP(k)网络，该网络拓扑具有许多优良的性质，因而能够更好地提高通信效率；针对网络嵌入问题，本文设计了将环、Mesh和Hypercube通信模式嵌入RP(k)互连网络的高效算法，使得在这些网络上已开发的应用能够高效地移植到RP(K)网络上。
     经过三年深入的研究，达到了预期的目的，取得了理想的结果，本文的主要创新点如下：
     ① 提出了一类基于Petersen图的互连网络拓扑结构RP(k)。该网络的连接度为5，直径为[k／2]+2；该网络具有短的网络直径、简单的拓扑结构、方便的路由策略；该网络硬件复杂度低、实现简单；在环、Mesh和Hypercube上设计的算法能够高效地嵌入RP(k)。同时，针对不同的互连网络规模，本文对该网络进行了扩展，提出了RP(P，k_1，k_2)网络拓扑，并讨论了其性质。
     ② 设计了RP(k)互连网络上的路由算法，主要设计了点点路由、置换路由、广播路由和All-to-all路由算法。这些算法的通信次数分别为[k／2]+2、k+5、[k／2]+2、k+5。这样，RP(k)网络和其上的路由算法为并行体系结构提供了一个实用的工具集。同时，在RP(P，k_1，k_2)网络上，本文也设计了以上四个通信模式的路由算法，它们的通信次数分别为[k_2／2]+[k_1／2]+2、4+m{k_2，k_1}+(k_2-1)~*(k_1-1)、[k_2／2]+[k_1／2]+2和10~*k_1*k_2-4。
     ③ 提出了将Ring和Mesh嵌入RP(k)网络的算法，证明了RP(k)网络是一个Hamiltonian图。考虑到网络的容错情况，当RP(k)网络中每个片有一个节点出现故障时，去掉故障节点和相应的边，得到互连网络RP-1(k)，我们证明了该网络也是Hamiltonian图。因此，可以分别将10*k和9*k个节点的环嵌入到RP(k)和RP-1(k)网络中去。同时，我们也设计了将2-D mesh嵌入RP(k)网络的方法，取得了良好的嵌入性能。
     ④ 基于光RP(k)网络，本文设计了一个实现n维Hypercube通信模式的算法，该算法最多需要max{2，[(5／3)~*2~(n-5)]}条波长。同时，设计了一个将n维Hypercube嵌入环网络
Parallel processing is one of the most interesting research areas in computer technology and has been investigated widely. Among those investigated subjects, how to enhance the communication efficiency is regarded as one of the key topics in parallel processing. The communication efficiency can be improved by several means, for example, designing efficient algorithms, writing efficient programs utilizing characteristics of computer hardware, and designing high efficient interconnection networks. Here we choose parallel computation models and interconnection networks as the key topics of our researches. Firstly, by improving the BSP model, the CSA-BSP model is proposed and it can guide programmers to write high efficient parallel programs. Secondly the topologies of interconnection networks are investigated and the RP(k) network is invented. The RP(k) network has many good properties, like simple topology, lower connectivity degree and smaller network diameter. Based on this network, the routing algorithms are designed.
    During the research, much work has been done. On interconnection networks, starting from some basic network topologies, for example, Ring, Torus, Hypercube and their routing strategies, some complex networks, like Butterfly networks, the Caylay graph networks, Hierarchy networks and Star-graph networks, are analyzed step by step. At the same time, the parallel computation models, mainly on shared memory models, hierarchy models, and message-passing models, are investigated. These understandings have contributed a lot to my reseaches.
    The main achievements of this thesis can be summarized as follows:
    ①Based on the Petersen graph and the Ring network, the RP(k) network is proposed. Its
    connectivity degree is 5 and its diameter is 「k/2」 + 2. Compared with 2-D Torus, the RP(k)
    network has many good properties, like short diameter, simple topology, flexible routing strategies. It is a practical interconnection network. To enlarge the size of the RP(k) network, the RP(P,k_1,k_2) network is proposed and its properties are analyzed.
    ②Based on the RP(k) network, the routing algorithms have been designed, which include point-to-point routing, One-to-all routing, permutation routing, and All-to-all routing.
    The efficiencies of these algorithms are 「k/2」 + 2、「k/2」 + 2、.k+5 and k+5 respectively. Thus
    the RP(k) network and its routing algorithms consist of a practical tool set for parallel architectures. Based on the RP(P,k_1,k_2) network, the above four routing algorithms are also

引文

[1] Adve S. V. and Gharachorloo K., Shared Memory Consistency Models: A Tutorial, IEEE Computer, Vol.29, No. 12,1996:66-76.

    [2] Agbaria A, Yosi Ben-Asher and Ilan Newman, Communication-processor Tradeoffs in Limited Resources PRAM, SPAA'99 ,1999.
    [3] Aggarwal A., Ashok K. Chandra, Marc Snir, On Communication Latency of PRAM Computations, Proc. 1~(st) Symp. on Parallel Algorithms and Architectures, June, 1989:11-21.
    [4] Aggarwal A., Ashok K. Chandra, Marc Snir, Communication complexity of PRAMs, Theory of Computer Science, Vol.71,No.l, 1990: 3-28.
    [5] Aleliunas R. and Rpsenberg A.L., On Embedding Rectangular Grids in Square Grids, IEEE Trans. On Computer, Vol. 31,No. 9, 1982: 907-913.
    [6] Alexandrov A., Mihai Ionescu, Klaus E. Schauser, LogGP: incorporating long messages into the LogP model for parallel computation, J. of Parallel Distributed Computation, Vol 44,No 1, 1997:71-79.
    [7] Alpern B, Ferrante L., Modeling Parallel Computations as Memory Hierarchies, Proc. of Programming Models for Massively Parallel Computers, Sept. 1993.
    [8] Alpern Bown et al., The Uniform Memory Hierarchy Model of Computation, Algorithmica, Vol.12,No.2-3, 1994.
    [9] Alvin R.Leback et al, Cache Profiling and the SPEC Benchmarks: A Case Studying, IEEE Trans, on Computer,Vol.43, No. 10, 1994.
    [10] Aumman Y et al., Clock Construction in Fully Asynchronous Parallel Systems and PRAM Simulation, Proc. of 33nd IEEE Symp. on Foundations of Computer Science, Oct. 1992: 147-156.

    [11] Azevedo M.M., Latifi S, and Bagherzadeh N., Low Expansion Packings and Embeddings of Hypercubes into Star Graph, Proc. of 15th IEEE Int'l Phenix Conf. On Comp. &comm., 1996:115-122.

    [12] Barry Wilkinson, and Michael Allen, Parallel Programming, Prentice Hall, New York, 1999.
    [13] Ben H.H. Juurlink and Harry A.G Wijshoff, The E-BSP Model incorporating General Locality and Unbalanced Communication into the BSP model, Proc. of Euro-Par'96, Springer LNCS 1124,1996:339-347.
    [14] Behrooz Parhami, Introduction to Parallel Processing, Plenum Press, New York, 1999.
    [15] Behrooz Parhami, Extended-Fault Diameter of mesh Networks, Proc. of Int'l Conf. Parallel and Distributed Processing:Techniques and Applications(PDPTA-2000),Las Vegas, Neveda, USA,2000 June.
    [16] Behrooz Parhami, Chi-Hsiang Yeh, Why Network Diameter is still Important, Proc. Int't conf. Communications in Computing(CIC-200), Las Vegas, USA,2000 June.
    [17] Behrooz Parhami and Ding-Ming Kwai, Challenges in Interconnection Network Design In the Era of Multiprocessor and Massively Parallel Microchips, Proc. Int't conf. Communications in Computing(CIC-200), Las Vegas, USA,2000 June.
    [18] Ben Miled and Fortes JAB., A Heterogeneous Hierarchical Solution to Cost-efficient High Performance Computing, Parallel And Distributed Processing Symp., 1996:138-145.
    [19] Bilardi G et al., BSP vs LogP, Proc. of 8th Annual ACM Symp. on Parallel Algorithms and Architectures, July, 1996.
    [20] Carl Hamacher V. and Hong Jiang, Hierarchical Ring Network Configuration and Performance Modeling, IEEE Tran. On Computer, Vol. 50, No.l, January 2001.
    [21] Chatterjee S. et al, Recursive Array Layouts and Fast Parallel Matrix Multiplication, Proceedings of the Eleventh ACM Symposium on Parallel Algorithms and Architecture, June 1999.
    [22] Chenxi Zhang et al, Two Fast and High-Associativity Cache Schemes, IEEE Micro, September, 1997:40-49.
    [23] Chen G.H. and Duh D.R., Topological Properties, Communication, and Computation on WK_recursive networks, Networks, Vol. 24, No. 6,1994:303-317.
    [24] Chunming Qiao and Yousong Mei, Off-Line Permutation Embeding and Scheduling in Multiplexed Optical Networks with Regular Topologies, IEEE/ACM Tran. On Networking, Vol.7, No.2, 1999:241-250.
    [25] Chunming Qiao and Yousong Mei, A Comparative Study of Cost Effective Multiplexing Approaches for Online Permutation Embeding and Scheduling in Optical Networks, Journal of Paralel and Distributed Computing, Vol. 61,No.l, 2001:1-17.
    [26] Culler D. et al., LogP: Towards a realistic model of parallel computation, Proc. of the ACM SIGPLAN Symp. on Principles & Practices of Parallel Programming, 1993:1 -12.
    [27] Culler D. et al., Fast Parallel Sorting under LogP:From Theory to Practice, Portibility and Performance for Parallel Processing, John Wiley&Son, 1993:71-89.
    [28] David A.Patterson and John L. Hennessy, Computer Architecture:A Quantitative Approach, China Machine Press, Beijing, 1999:37-42.
    [29] David E. Culler, Jaswinder Pal Singh and Anoop Gupta, Parallel Computer Architecture, China Machine Press, Beijing, 1999:936-944.
    [30] Eytan Modiano, Richard Barry, Design and Analysis of an Asynchronous WDM Local Area Network Using Master/slave Scheduler, INFOCOM99, New York, 1999:900-907.
    [31] Fangai Liu, Zhiyong Liu, Xiangzhen, Qiao, The Cache Performance Analysis of Algorithms, the Third International Conference on Computer-Aided Industrial Design and Conceptual Design, Hong Kong, November, 2000:313-317.
    [32] Fangai Liu, Zhiyong Liu and Xiang zhen Qiao, Program Optimizing and Performance Analysis Based on SMP-Cluster, PDPTA01, Lag Vegas, USA, 2001:815-821.
    [33] Fangai Liu, Zhiyong liu, and Qiao Xiangzhen, A Practical Interconnection network RP(k) and Its Routing Algorithms, Science in China(F), Vol.44 No.6,2001:462-473
    [34] Fen Lin Wu et al., Routing in a Class of Cayley Graphs of Semidirect Products of Finite Groups, Journal of Parallel and Distributed Computing Vol.60, 2000:539-565.
    [35] Fortune S. and Wyllie J., Parallelism in Random Access Machines. Proc. of 10 th Annual Symposium on Theory of Computing. 1978:114 - 118.
    [36] George N. Rouskus and Mostafa H. Ammar, Muli-Destination Communication Over Tunable-Receiver Single-Hop WDM Network, IEEE J-SAC, Vol.15, No.3, April 1997: 501-521.
    [37] Gerbessiotis A V, and Valiant L G., Direct Bulk-Synchronous Parallel Algorithms, LNCS, Springer Verlag, Vol. 621, 1992:1-18.
    [38] Gibbons P B., A More Practical PRAM Model, Proc. of 1st ACM Symp. on Parallel Algorithms & Architectures, June, 1989: 158-168.
    [39] Gibbons P B et al., Queue-read queue-write PRAM model: Accounting for Contention in Parallel Algorithms, Proc. of 5th ACM-SIAM Symp. on Discrete Algorithms, Jan. 1994: 638-648.
    [40] Gibbons P B., What Good are Shared-Memory Models? Proc. of the 1996 Workshop on Challenges for Parallel Processing, Aug., 1996:103-114.
    [41] Glen Kramer and Gerry Pesavento, Ethernet Passive Optical Network: Building a Next-Generation Optical Access Network, IEEE Communication Magazine, Vol.40, No.2, 2002:74-80.
    [42] Greg Gravenstreter and Rami G. Melhem, Realizing Common Communication Patterns in Partitioned Optical Passive Stars Networks, IEEE Tran. on Computer, Vol. 47, No.9,1998:998-1102.
    [43] Hambrusch S E and Khokhar A A., C~3: A Parallel Model for Coarse-grained Machines, J. of Parallel and Distributed Computing, Vol.32, No.1, 1996: 139-154.
    [44] Hambrusch S E., Models for Parallel Computation, Proc. of the 1996 Workshop on Challenges for Parallel Processing, Aug, 1996:92 - 95.
    [45] Heywood T et al., A Practical Hierarchical Model of Parallel Computation I. J. of Parallel & Distributed Computing, Vol.16, No. 2,1992: 212 - 232.
    [46] Heywood T et al., A Practical Hierarchical Model of Parallel Computation II, J. of Parallel & Distributed Computing, Vol.16, No.3, 1992: 233 - 249.
    [47] Hill M. D. and Smith A.J., Evaluating Associativity in CPU Cache, IEEE Trans. Computers, Vol. 38,No.12, 1989:1612-1630.
    [48] Howard Jay Siegel, Interconnection Networks for Large-scale Parallel Processing, McGraw-Hill Publishing Company, New York, 1990.
    [49] Huan-Chao Keh, Jen-Chih Lin, On Fault-tolerant Embedding of Hamilton Cycle, Linear Array and Rings in a Flexible Hypercube, Parallel Computing, Vol.26, No.6, 2000: 769-781.

    [50] Hwang K., Advanced Computer Architecture, MeGraw-Hill, New York, 1993.
    [51] Jan Trdlicka, Pavel Tvrdik, Embedding Complete k-ary Trees into k-Square 2-D Meshes with Optimal Edge Congestion, Parallel Computing, Vol.26, No.6, 2000:783-790.
    [52] Jean Walrand, Pravin Varaiya, High-Performance Communication Networks, China Machine Press, Beijing, 2000.
    [53] JaJa J F., An Introduction to Parallel Algorithms, Addison-Wesley Publishing Company, New York, 1992.
    [54] Jaja J F., Fundamentals of Parallel Algorithms, Parallel & Distributed Computing Handbook, McGraw-Hill, New York. 1996:333-354.
    [55] Kai Huang and Zhiwei Xu, Scalable Parallel Computing, China Machine Press, Beijing, 1999.
    [56] Krishnamurthy E V, Complexity Issues in Parallel and Distributed Computing, Parallel & Distributed Computing Handbook, McGraw-Hill, New York, 1996: 89 - 126.
    [57] Kendall Square Research Corporation, KSR1 principles of operation, Oct. 1991.
    [58] Kosaraju S.R. and Atalah M.J., Optimal Simulation Between Mesh-Connected Arrays of Processors, J. ACM, Vol. 35, 1988:635-650.
    [59] Kronsjo L I, PRAM Models, Parallel & Distributed Computing Handbook, McGraw-Hill, New York, 1996.
    [60] Lakshmivarahan S., Sudarshan K.Dhall, Ring, Torus and Hypercube Architectures /Algorithms for Parallel Computing, Parallel Computing Vol. 25, No. 13-14, 1999:1877-1906.
    [61] Lee S.K., Choi H.A., Embedding of Complete Binary Tree into Meshes with Row-column Routing, IEEE Transactions On Parallel and Distributed System, Vol.7,No.5, 1996: 493-497.
    [62] Leiserson C E., Fat-trees: Universal Networks for Hardware-efficient Supercomputing, IEEE Trans, on Computer, Vol.34, 1985: 892-901.
    [63] Lenoski D E and Weber W D., Scalable Shared-Memory Multiprocessing, Morgan Kaufmann Publishing House, 1995.
    [64] Lu Ruan, Dingzhu Du, Xiaodong Hu et al, Converter Placement Supporting Broadcast in WDM Optical Networks, IEEE Tran. on Computer,Vol.50,No.7, 2001:750-758.
    [65] Lusk E. L. and Group W., A Taxonomy of Programming Models for Symmetric Multiprocessors and SMP Clusters, Proceedings of Programming Models for Massively Parallel Computers, 1995:2-7.
    [66] Mansour Y et al., Trade-offs Between Communication Throughput and Parallel Time, Proc. 26th ACM Symp. On Theory of Computing, 1994:372-381.
    [67] Matteo Frigo, Cache-Oblivous Algorithm, PhD dissertation, MIT, 1999.
    [68] Ming Yang Su et al., Broadcasting on Incomplete WK-Recursive Networks, Journal of Parallel and Distributed Computing, Vol. 57, 1999:217-224.
    [69] MPI1.1/2.0 文档, Http://lssc.cc.ac.cn.
    [70] OpenMP Arcgitecture Review Board, OpenMP Fortran Application Program Interface, Http://www. openmp.org, 1997.
    [71] Phillip B.Gibbons et al., What Good are Shared-Memory Models, The International Conference on Parallel Processing Workshop, 1996.
    [72] Rodriguez C, et al, A New Parallel Model for the Analysis of Asynchronous Algorithms, Parallel Computing, Vol.26,No.6, 2000:753-767.
    [73] Ronald Fernandes et al, Generalized Ring Interconnection Networks, The proceedings of 1994 International Conference on Parallel Processing, 1994:30-34.
    [74] Sajal K. Das, et al., Hyper Petersen Network: Yet Another Hypercube-like Topology, The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992: 270-277.
    [75] Sanjay Ranka and Sartaj Sahni, Hypercube Algorithms, Springer-Verlag, New York, 1990.

    [76] Seitz, C. L., The Cosmic Cube, Communication of ACM, Vol. 28, No.l, 1985:22-33.
    [77] Ulm D and Baker J W., Simulation PRAM with a MSIMD Model, Proc. of the ICPP'98, Aug. 1998.
    [78] Valiant L G., A Bridging Model for Parallel Computation, Communication of the ACM, Vol.33,No.8, 1990:103-111.
    [79] Xiaojun Shen, et al, On Embedding Between 2D Meshes of the Same Size, IEEE Trans. Computer, Vol. 46, No.8, 1997.
    [80] Yeh,C.-H. And B.Parhami, Hierarchical Swapped Networks: Efficient Low-degree Alternatives to Hypercube and Generalized Hypercubes, Proc. of Int'l Symp. On Parallel Architectures, Algorithms, and Networks, 1996:90-96.
    [81] Yeh Chi-Hsiang and Behrooz Parhami, Routing and embeddings in cyclic Petersen network: an efficient extension of the Petersen graph, Proc. Int'l of Conf. Parallel Processing, Sep. 1999:258-265.
    [82] Yeh Chi-Hsiang and Varvarigos, Macro-star networks: efficient low-degree alternatives to star graphs, IEEE Tran. on Parallel &Distributed System, Vol 9, No. 10, 1998:987-1003.
    [83] Yuan X. and Melhem R., Optimal Routing and Channel Assignments for Hypercube Communication on Optical Mesh-like Processor Arrays, Fifth International Conference on Massively Parallel Processing Using Optical Interconnection, Las Vegas, June 1998.
    [84] Yuan X., Rami Melhem and Rajiv Gupta, Distributed Path Reservation Algorithm for Muliplexed All-Optical Interconnection Networks, IEEE Tran. On Computer, Vol.48, No.12, 1999:1355-1363.
    [85] Yuan X., Melhem R. and Gupa R., Performance of Multi-hop Communications Using Logical Topologies on Optical Torus Networks, Journal of Parallel and Distributed Computing, 2001, vol 61(6): 748-766.
    [86] Yuanyuan Yang, Jianchao Wang and Yi Pan, Permutation Capability of Optical Multistage Interconnection Networks, Journal of Parallel and Distributed Computing, Vol. 60, No. 1,2000:72-91.
    [87] Yuh Rong Leu and Sy-Yen Kuo, Distributed Fault-Toerant Ring Embeding and Reconfiguration in Hypercubes, IEEE Tran. on Computers, Vol.48, No.1,1999:81-88.
    [88] Yu Liang Liu, Yue Li Wang and D.J. Guan, An Optimal Faulttolerant Routing Algorithm for Double-Loop Networks, IEEE Tran. on Computer, Vol.50, No.5, 2001:500-505.
    [89] Zeydy Ortiz, George N.Rouskas, and Harry G.Perros, Maximizing Multicast Throughput in WDM Networks with Tuning Latencies Using the Virtual Receiver Concept, European Trans. On Telecommunications, Vol.11, Jan.-Feb., 2000:63—72.
    [90] Zhao Zhang et al, Cache-Optimal Methods for Bit-Reversals, Proceedings of Supercomputing'99, November, 1999.
    [91] Zhen Liu and Ting-Yi Sung, Routing and Transmitting Problems in de Bruijn Network, IEEE Tran. On Computer, Vol. 45, No.9, 1996: 1056-1062.
    [92] Zheng S. Q. et al, Dual of a Complete Graph as an Interconnection Network, Journal of Parallel and Distributed Computing, Vol. 60, 2000:1028-1046.
    [93] Zhiyong Liu and D.W. Cheung, Efficient Parallel Algorithm for Dense Matrix LU Decomposition with Pivoting on Hypercubes, International Journal of Computer and mathmatics with applications, Vol.33, No.8, 1997:39-50.
    [94] Zhiyong Liu and D. Zhang, Oblivious Routing for LC Permutation on Hypercubes, Parallel Computing,Vol.25, 1999: 445-460.
    [95] Z.Liu, E.Li, and X. Qiao, XOR Mapping Technique for Data Caches, China Patent #ZL97120245.1, Nov.6, 1997.
    [96] 乔香珍，刘方爱，并性计算模型综述，计算机科学，已接受，待发表。
    [97] 乔香珍，“Cache性能与程序优化”，计算机学报，Vol．19，No．11，Nov．1996。
    [98] 孙家昶等，网络分布计算与分布式编程环境，科学出版社，北京，1996。
    [99] 李明，唐志敏，一种新的Cache优化方法——部分Cache局部性方法，计算机学报，Vol．20，No．1，1997。
    [100] 李晓梅，莫则尧，胡庆丰等，可扩展并行算法的设计与分析，国防工业出版社，北京，2000．7。
    [101] 许进，自补图理论及其应用，西安电子科技大学出版社，1999。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700