基三体系结构中并行运算的关键机制研究

作者：李嘉欣
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：基三体系结构 ; 并行计算 ; 体系结构描述语言 ; 数据交换结构
英文关键词：TriBA ; Parallel processing ; Architecture Description Language ; Data Switch Structure
学位年度：2010
导师：石峰
学科代码：081202
学位授予单位：北京理工大学
论文提交日期：2010-06-26
答辩委员会主席：林守勋

摘要

片上多核处理器(Chip Multi-Processor,CMP)已经成为提高计算机性能的主要方式之一,基于多核处理器的并行计算是现阶段高性能计算研究的热点,同时也存在不少的难点,包括如何充分利用多核处理资源,如何帮助程序设计人员在多核体系结构基础上进行并行编程等。基三体系结构(Triplet-Based Architecture,TriBA)是一个面向对象的多核处理器体系结构,在TriBA上的并行计算研究也存在这些问题。本文对TriBA中并行计算所涉及到的一些关键技术进行了深入研究和探讨,包括对多核处理器的并行程序设计辅助工具的研究以及对基于片上网络( Network-on-Chip , NoC)的多核体系结构中并行数据流调度和数据传输的研究。研究内容及主要成果包括:
     1.提出并部分实现了一种多核体系结构上并行程序设计的辅助工具——反馈并行程序设计框架(Feedback Parallel Programming Framework,FPPF)的设计思想。FPPF的主要思想是帮助程序设计人员降低程序设计的思考层次,了解一定的硬件特征,选择适合的程序设计方案,从而编写出适用于特定多核体系结构的并行程序,提高程序的性能。FPPF的构成是组件化的,各种体系结构及相应算法以模板的形式存在,程序设计人员可以通过组合、修改、新建体系结构模板来预先构造并行程序的解决方案,并且通过FPPF对各方案的评估和比较来选择较优的方案进行进一步程序设计,从而减轻反复修改、调试、验证的负担。另外,也可以将现有的工具以组件或模块的形式添加到FPPF中。
     2.对TriBA片上网络拓扑的遍历性质进行了证明,包括TriBA的Hamilton路和最小生成树两个方面。定义了流水模型的概念,并构造了TriBA的几种流水模型。在FPPF中,流水模型可以用于构造其用户接口组件,从而帮助程序设计人员了解体系结构的拓扑特征。还可以利用流水模型进行并行程序的顶层设计,以及对并行数据流的调度。
     3.提出一种并行体系结构描述方法——层次化并行运算模型(Hierarchy Parallel Computing Model,HPCM)。HPCM是一种自嵌套的多层次并行体系结构描述方法,该方法能够灵活地在不同粒度层面上对并行体系结构及其运行方式进行描述。同时,还提出了基于不同精度的HPCM的并行解决方案的性能评估方法。HPCM及其相应的性能评估方法可以用于构造FPPF的体系结构模板库组件和静态评估引擎组件。
     4.提出了对多核处理器片上网络中并行数据传输的关键部件——并发多方向数据交换结构(Concurrent Multi-direction Data Switch Structure, CMDSS)的设计方法,称作图状态选择(Graph State Select, GSS)。GSS利用片上网络的拓扑特征,对多方向数据交换结构的基本状态进行提取。提出并实现了控制调度算法FG-NC,该算法利用GSS提取的状态来构造对数据交换结构的控制码,从而在特定硬件条件下提高数据交换结构的并行性。利用GSS对TriBA的InterUnit进行了重新设计,提供了对单播、组播和广播数据并行传输的高效支持。
     5.提出了一组利用多核处理器片上网络拓扑特征进行数据流调度的方法——基于拓扑特征的流调度(Stream Schedule based on Topology Features,SSTF)。SSTF策略主要包含平分策略和选择策略,其中平分策略用于体系结构中固有负载较少的情况,选择策略在固有负载较多时利用拓扑权重来辅助平分策略完成数据流任务的调度。本文以SSTF在基三网络中的应用为例,计算了基三网络的拓扑权重,对各种平分策略在包含和不包含传输延迟的情况进行了分析。
Chip Multi-Processors (CMP) has become one of the most important methods to improve the performance of the computer. The CMP-based parallel computing is a hot-spot issue, and is also a difficult issue that all of the programmers should face. The issue is relevant to how to utilize the emerging huge and diversified CMP computing resources, and how to help programmers to design parallel applications based on CMP. Triplet-Based Architecture(Triplet-Based Architecture, TriBA) is an object-oriented CMP architecture. Many of those problems are also existed in TriBA. This dissertation deeply researches and discusses some key technologies relavant to parallel computing in TriBA, which include researches on“aid tools for parallel programming design based on CMP”and“parallel data stream scheduling and data transferring in CMPs based on Network-on-Chip (NoC)”. The brief research content and achievement in this dissertation is:
     1. An aid tool for parallel programming design based on CMP is proposed, which is called FPPF (Feedback Parallel Programming Framework). The main idea of FPPF is to help the programmer think in a low level during the course of programming, which can be convenient for the programmers to learn some hardware features, to choose more proper solutions, to develop parallel programs that fit to specific CMP architecture, and eventually improve the performance of the program. FPPF is composed of many components. Several architectures and corresponding algorithms are stored in FPPF as patterns. Programmers can combine, modify or create these patterns to configure their solutions, which are evaluated and compared by FPPF to find a better one for programmers to continue their further design. The course avoid the burdon of repeatly reversing, debugging and verificating. Besides, some existing tools can also be added into FPPF as components or modules.
     2. The ergodic property of NoC topology in TriBA is proved, including Hamilton route and the minimal spanning tree. The concept of Streaming Model is defined, and TriBA’s streaming models are constructed. In SPPF, streaming model can be used to construct the User Interface component. It can not only help programmers to know the topology features of specific architecture, but also can be used to deal with top level program design and to schedule the parallel data streams.
     3. HPCM (Hierarchy Parallel Computing Model), which is a method to describe parallel architectures, is proposed. HPCM is a self-nesting description of hierarchy parallel architectures. It can describe parallel architectures and their running patterns in several granularities. The method of evaluating parallel solutions based on HPCM with different granularities is also introduced. HPCM and its evaluation method can be utilized to construct the architecture pattern library and the static evaluating engine in FPPF.
     4. The method of designing a key component (Concurrent Multi-direction Data Switch Structure, CMDSS) for transferring parallel data is proposed. The method is called GSS (Graph State Select). GSS can utilize topology features of NoC to extract basic states of CMDSS. A control and schedule algorithm called FG-NC is also introduced and implemented. FG-NC transform the states found by GSS into control codes, thus improving the parallelism of CMDSS. The InterUnit in TriBA is re-designed using GSS, and the new InterUnit efficiently support parallel data transfers with unicast, groupcast and broadcast types.
     5. Methods that utilize the features of architecture topology to compute the weights on every edge are proposed. These methods are called SSTF (Stream Schedule based on Topology Features). SSTF mainly includes Divide and Select methods. Divide methods are suitable for the architecture with little load. When there is more load, Select method can be used to assist Divde method to complete the data stream scheduling. The use of SSTF in Triplet-based Interconnection Networks (THIN) is taken as an example. The topology weight of THIN is computed, and the situations that Divide method with and without transfer latency are also be analysed.

引文

[1] A. C. Sodan, J. Machina, A. Deshmeh,et al. Parallelism via Multithreaded and Multicore CPUs [J]. Computer, 43(3): 24-32.
    [2] Parallel Virtual Machine (PVM) Version 3 [EB/OL]. (2009.2.2) http://www.netlib.org/pvm3/.
    [3] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface [EB/OL]. (1997-08-11) [2008-09-20]. http://nf.nci.org.au/training/MPI/MPI_2.0/mpi2-report.html.
    [4] R.W. Numrich,J.K. Reid. Co-Array Fortran for parallel programming [J]. ACM SIGPLAN Fortran Forum, 1998, 17(2): 1-31.
    [5] Sun Corporation. Sun Studio 12: OpenMP API User's Guide [EB/OL]. [2009-04-15]. http://docs.sun.com/app/docs/doc/819-5270?l=zh.
    [6] J.V.W. Reynders,J. Cummings. The POOMA framework [J]. Computers in Physics, 1997, 12(5): 453-459.
    [7] Johnson Elizabeth,Gannon Dennis. HPC++: experiments with the parallel standard template library [C]. Proceedings of the 11th international conference on Supercomputing. Vienna, Austria: ACM.
    [8] K. Keahey,D. Gannon. PARDIS: A parallel approach to CORBA [C]. Proceedings of the Sixth IEEE Symposium on High Performance Distributed Computing(HPDC'97). 1997.
    [9] D. de St. Germain,J. McCorquodale. Uintah: A massively parallel problem solving environment [C]. Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing(HPDC 2000). 2000.
    [10] H. Casanova,J. Dongarra. NetSolve: A network-enabled server for solving computational science problems [J]. The International Journal of Supercomputer Applications and High Performance Computing, 1997, 11(3): 212-223.
    [11] G. Fox,W. Furmanski. The Grid: Blueprint for a New Computing Infrastructure [J]. High-performance commodity computing, 1998.
    [12] Khronos. Khronos OpenCL API Registry [EB/OL]. (2010) http://www.khronos.org/registry/cl/.
    [13] P. Dandamudi Sivarama,D. L. Eager. Hierarchical Interconnection Networks for Multicomputer Systems [J]. 1990, 39: 786-797.
    [14]刘滨.基三体系结构中负载平衡及进程迁移的研究[D].北京:北京理工大学, 2007.
    [15] Weixing Ji, Feng Shi, Baojun Qiao,et al. The Design of a Novel Object-oriented Processor : OOMIPS [C]. IEEE International Conference on Application -specific Systems, Architectures and Processors (ASAP). IEEE, 2007.
    [16]乔保军.基三多内核体系结构中互连关键技术的研究[D].北京:北京理工大学, 2007.
    [17] Feng SHI, Weixing JI, Baojun QIAO,et al. A Triplet-based Computer Architecture Supporting Parallel Object Computing [C]. IEEE International Conference on Application -specific Systems, Architectures and Processors (ASAP). IEEE, 2007.
    [18]王佐.基三多核处理器片上存储系统若干关键技术的研究[D].北京:北京理工大学, 2009.
    [19] Wang Zuo, Shi Feng, Zuo Qi,et al. Group-caching for NoC based multicore cache coherent systems [C]. Design, Automation & Test in Europe Conference & Exhibition(DATE '09). Nice, France: IEEE Computer Society, 2009.
    [20] Wang Zuo, Shi Feng, Zuo Qi,et al. N-port memory mapping for LUT-based FPGAs [C]. Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays. Monterey, California, USA: ACM.
    [21] Liu Caixia, Li Jiaxin, Zhang Hongli,et al. HHMA: A hierarchical hybrid memory architecture sharing multi-port memory [C]. Proceedings of the 9th International Conference for Young Computer Scientists(ICYCS 2008). IEEE Computer Society, 2008.
    [22] Deng Ning, Ji Weixing, Li Jiaxin,et al. A Novel Adaptive Scratchpad Memory Management Strategy [C]. The 15th IEEE International Conference on Embedded and Real-time Computing Systems and Applications(RTCSA'09). Beijing, China: 2009: 236-241.
    [23]刘滨,石峰.基三分层互连网络中负载平衡的研究与仿真[J]. 2006, 18(2): 781-784.
    [24] Intel.英特尔-面向开发人员的软件开发产品[EB/OL]. (2010) [2010]. http://www.intelsoftware.com.cn/.
    [25] IBM. Multicore Software Development Kit [EB/OL]. (2010.4.21) http://www.alphaworks.ibm.com/tech/msdk.
    [26] NVIDIA. NVIDIA CUDA计算统一设备架构编程指南2.0 [R]. 2008.
    [27] Takashi OSHIRO, Masaaki NAKAMURA,Yasuhiro HATAKEYAMA. Performance Evaluation for Distributed Systems [R]. 1998.
    [28] Standard Performance Evaluation Corporation. SPEC: Standard Performance Evaluation Corporation [EB/OL]. (2009.9.9) [2008.3.1]. http://www.spec.org/.
    [29] Transaction Processing Performance Council. TPC: Transaction Processing Performance Council [EB/OL]. (2009.7) [2008.3.1]. http://www.tpc.org/.
    [30] Stanford University. Stanford Parallel Applications for Shared Memory (SPLASH) [EB/OL]. (2001.9.7) [2008.3.1]. http://www-flash.stanford.edu/apps/SPLASH/.
    [31] ARM. RealView Tool by ARM [EB/OL]. (2009.12.17) [2010.1.1]. http://www.realview.com.cn/.
    [32] Intel. Data Parallelism - Intel Software Network [EB/OL]. (2009.12) [2010]. http://software.intel.com/en-us/data-parallel/.
    [33] Virtutech. Virtutech Announces Simics VMP, Performance-Enhancing Modeling Technology For Virtual Software Development [EB/OL]. [2009]. http://www.virtutech.com/news_events/pr/pr2007_04_02-b.html.
    [34] D.A Bader, V Kanade,K. Madduri. SWARM: A parallel Programming Framework for Multicore Processors [C]. IEEE International Parallel and Distributed Processing Symposium(IPDPS 2007). IEEE.
    [35] Gummaraju Jayanth, Coburn Joel, Turner Yoshio,et al. Streamware: programming general-purpose multicore processors using streams [C]. Proceedings of the 13th international conference on Architectural support for programming languages and operating systems. Seattle, WA, USA: ACM.
    [36] C. Zilles. Master/Slave Speculative Parallelization and Approximate Code [D]. University of Wisconsin-Madison, 2002.
    [37] Lance Hammond,Ben Hubbert. The Stanford hydra CMP [J]. IEEE Microwave Magazine, 2002, (2).
    [38] Nayfeh Basem Adnan. The case for a single-chip multiprocessor [J]. 1999: 98.
    [39] Taylor Michael Bedford, Lee Walter, Miller Jason,et al. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams [C]. Proceedings of the 31st annual international symposium on Computer architecture. Germany: IEEE Computer Society, 2004.
    [40] Sankaralingam Karthikeyan, Nagarajan Ramadass, Liu Haiming,et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture [C]. Proceedings of the 30th annual international symposium on Computer architecture. San Diego, California: ACM.
    [41]张先迪,李正良.图论及其应用[M].北京:高等教育出版社, 2005.
    [42] J. Dally William. Performance Analysis of k-ary n-cube Interconnection Networks [J]. 1990, 39:775-785.
    [43]陈萌萌,邵贝贝.单片机系统的低功耗设计策略[J].单片机与嵌入式系统应用, 2006, 03.
    [44] Li Xiaobo, Lu Paul, Schaeffer Jonathan,et al. On the versatility of parallel sorting by regular sampling [J]. 1993, 19: 1079-1103.
    [45] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest,et al. Introduction to Algorithms(Second Edition) [M].机械工业出版社, 2006.
    [46] C. A. R. Hoare. Algorithm 64: Quicksort [J]. 1961, 4: 321.
    [47] R. S. Francis,I. D. Mathieson. A Benchmark Parallel Sort for Shared Memory Multiprocessors [J]. 1988, 37: 1619-1626.
    [48] Michael J. Quinn. Parallel sorting algorithms for tightly coupled multiprocessors [J]. Parallel Computing, 1988, 6: 349-357.
    [49] M. Wheat,D.J. Evans. An efficient parallel sorting altorithm for shared memory multiprocessors [J]. Parallel Computing, 1992, 18: 91-102.
    [50] Michael J. Quinn. Parallel Programming in C with MPI and OpenMP [M]. Beijing: McGraw-Hill Education (Asia) Co. and Tsinghua University Press, 2005:346-349.
    [51] W. Qin,S. Malik. A Study of Architecture Description Languages from a Model-based Perspective [C]. Proceedings of the Sixth International Workshop on Microprocessor Test and Verification (MTV'05). 2005.
    [52] IEEE Standard Hardware Description Language Based on the Verilog Hardware Description Language [R]. IEEE Inc., 2001.
    [53] IEEE Standard VHDL Language Reference Manual [R]. IEEE Inc., 2000.
    [54] G. Zimmerman. The MIMOLA design system: A computeraided processor design method [C]. Proceedings of Design Automation Conference. 1979: 53-58.
    [55] R. Leupers,P. Marwedel. Retargetable generation of code selectors from HDL processor models [C]. Proceedings of Conference on Design Automation and Test in Europe(DATE'97). 1997: 140-144.
    [56] H. Akaboshi. A Study on Design Support for Computer Architecture Design [D]. Japan: Kyushu University, 1996.
    [57] P. S. Coe, F. W. Howell, R. N. Ibbett,et al. Technical note: A hierarchical computer architecture design and simulation environment [J]. ACM Transactions on Modeling and Computer Simulation, 1998, (10): 431-446.
    [58] J. Emer, P. Ahuja, E. Borch,et al. Asim: A performance model framework [J]. IEEE Computer, 2002, (2): 68-76.
    [59] M. Vachharajani, N. Vachharajani, D. Penry,et al. Microarchitectural exploration with Liberty [C]. Proceedings of International Symposium on Microarchitecture. 2002: 271-282.
    [60] S. O'nder,R. Gupta. Automatic generation of microarchitecture simulators [C]. Proceedings of the IEEE International Conference on Computer Languages. 1998: 80-89.
    [61] S. Pees, A. Hoffmann, V. Zivojnovic,et al. LISA– machine description language for cycle-accurate models of programmable DSP architectures [C]. Proceedings of Design Automation Conference(DAC'99). 1999: 933-938.
    [62] ARM - The Architecture for the Digital World [EB/OL]. (2009.10.1) [2008.3.1]. http://www.arm.com.
    [63] CoWare: Electronic System Virtualization Solutions [EB/OL]. (2009.10.1) http://www.coware.com/.
    [64] C. Siska. A processor description language supporting retargetable multi-pipeline DSP program development tools [C]. Proceedings of the International Symposium on System Synthesis(ISSS'98). 1998: 31-36.
    [65] S. Rigo, R. J. Azevedo,G. Araujo. The ArchC architecture description language [R]. Brazil: Institute of Computing of the University of Campinas, 2003.
    [66] T. Murata. Petri Nets: Properties, analysis and applications [J]. Proceedings of the IEEE, 1989, 77(4): 541-580.
    [67] F. Burns, A. Koelmans,A. Yakovlev. Modelling of superscalar processor architectures with design/CPN [C]. Proceedings of Workshop on Practical Use of Coloured Petri Nets and Design. 1998.
    [68] W. M. Zuberek, R. Govindarajan,F. Suciu. Timed Colored Petri Net models of distributed memory multithreaded processors [C]. Proceedings of Workshop on Practical Use of Coloured Petri Nets and Design. 1998.
    [69] W. L. A. de Oliveira, N. Marranghello,F. Damiani. Modeling a processor with a Petri Net extension for digital systems [C]. Proceedings of the Conference on Design, Analysis, and Simulation of Distributed Systems. 2004.
    [70] M. Reshadi,N. Dutt. Generic pipelined processor modeling and high performance cycle-accurate simulator generation [C]. Proceedings of Conference on Design Automation and Test in Europe(DATE2005). 2005.
    [71] A. Fauth, J. V. Praet,M. Freericks. Describing instructions set processors using nML [C]. Proceedings of Conference on Design Automation and Test in Europe(DATE). Paris, France: 1995: 503-507.
    [72] G. Hadjiyiannis, S. Hanono,S. Devadas. ISDL: An instruction set description language for retargetability [C]. Proceedings of Design Automation Conference. 1997: 299-302.
    [73] A. Halambi, P. Grun, V. Ganesh,et al. EXPRESSION: A language for architecture exploration through compiler/simulator retargetability [C]. Proceedings of Conference on Design Automation and Test in Europe(DATE'99). 1999: 485-490.
    [74] P. Mishra, N. Dutt,A. Nicolau. Functional abstraction driven design space exploration of heterogeneous programmable architectures [C]. Proceedings of the International Symposium on System Synthesis(ISSS 2001). 2001: 256-261.
    [75] J. C. Gyllenhaal, W. Hwu,B. R. Rao. HMDES version 2.0 specification [R]. University of Illinois at Urbana-Champaign, 1996.
    [76] A. Wang, E. Killian, D. Maydan,et al. Hardware/ software instruction set configurability for system-onchip processors [C]. Proceedings of Design Automation Conference(DAC 2001). 2001: 184-188.
    [77] J. Teich, R. Weper, D. Fischer,et al. A joined architecture/compiler environment for ASIPs [C]. Proceedings of International Conference on Compilers, Architectures and Synthesis for Embedded Systems(CASES'2000). San Jose, CA: 2000.
    [78] J. Matthews, B. Cook,J. Launchbury. Microprocessor specification in Hawk [C]. Proceedings of the International Conference on Computer Languages. 1998: 90-101.
    [79] N. Ramsey,J. W. Davidson. Machine descriptions to build tools for embedded systems [C]. Proceedings of the ACM SIGPLAN Workshop on Languages,Compilers, and Tools for Embedded Systems(LCTES'98). 1998: 176-192.
    [80] M. Alain Le,haut. Fractal geometries: theory and applications [M]. CRC Press, Inc., 1991.
    [81] U.J. Kapasi,W.J. Dally. The Imagine Stream Processor [C]. Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02). IEEE Computer Society.
    [82]王晓袁,杨银堂.一种用于片上网络的交换开关结构[J].微计算机信息, 2008, 24(9-2):71-73.
    [83] Taylor Michael Bedford, Lee Walter, Miller Jason,et al. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams [C]. Proceedings of the 31st annual international symposium on Computer architecture. Munchen, Germany: IEEE Computer Society.
    [84]莱维尼.先进计算机体系结构与并行处理[M].电子工业出版社, 2005.12.1.
    [85] Stanford University. SIM: A Fixed Length Packet Simulator [EB/OL]. (2003-05-09) [2007-06-01]. http://klamath.stanford.edu/tools/SIM.
    [86]邓宁. ATM交换机仿真软件SIM分析[D].长沙:国防科学技术大学, 2003.
    [87]张春元,文梅,伍楠,et al.流处理器研究与设计[M].北京:电子工业出版社, 2009.
    [88] Fractal geometry [M].
    [89] Target Compiler Technologies N.V [EB/OL]. (2009.10.1) [2007.5.1]. http://www.retarget.com.
    [90] SimpleScalar LLC [EB/OL]. (2008.8.25) [2008.3.1]. http://www.simplescalar.com/.
    [91] Eric Allen, David Chase, Joe Hallett,et al. The Fortress Language Specification [M]. Sun Microsystems, 2008.
    [92] D. G. Bradlee, R. R. Henry,S. J. Eggers. The Marion system for retargetable instruction scheduling [C]. Proceedings of the Conference on Programming Language Design and Implementation(PLDI'91). 1991.
    [93] D. Carlstrom Brian, McDonald Austen, Chafi Hassan,et al. The Atomos transactional programming language [C]. Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation. Ottawa, Ontario, Canada: ACM.
    [94] Doug Burger,Stephen W. Keckler. Scaling to the End of Silicon with EDGE Architectures [J]. IEEE Computer, 2004, 7.
    [95] Ranger Colby, Raghuraman Ramanan, Penmetsa Arun,et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems [C]. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE Computer Society.
    [96] Cray. Chapel Programming Language [EB/OL].
    [97] Kohler Eddie, Morris Robert,Chen Benjie. Programming language optimizations for modular router configurations [C]. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems. San Jose, California: ACM.
    [98] Hou ESH, Ansari N,Ren H. A genetic algorithm for multiprocessor scheduling [J]. IEEE Trans.on Parallel and Distributed Systems, 1994, 5(2): 113-120.
    [99] Boronat Fernando, Lloret Jaime,Garc Miguel. Multimedia group and inter-stream synchronization techniques: A comparative study [J]. 2009, 34(1): 108-131.
    [100] Sih GC,Lee EA. A compile-time scheduling heuristic for interconnection constrained heterogeneous processor architectures [J]. IEEE Trans. on Parallel and Distributed Systems, 1993, 4(2): 75-87.
    [101] Khan Haroon-Ur-Rashid,Feng Shi. Performance Evaluation of TriBA -- A Novel Scalable Architecture for High Performance Applications [C]. Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 03. IEEE Computer Society.
    [102] Ahmad I,Kwok YK. On exploiting task duplication in parallel programs scheduling [J]. IEEE Trans. on Parallel and Distributed Systems, 1998, 9(9): 872-892.
    [103] Nickolls John, Buck Ian, Garland Michael,et al. Scalable Parallel Programming with CUDA [J]. 2008, 6: 40-53.
    [104] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen,et al. A Network on Chip Architecture and Design Methodology [C]. Proceedings of the IEEE Computer Society Annual Symposium on VLSI. IEEE Computer Society.
    [105] Seiler Larry, Carmean Doug, Sprangle Eric,et al. Larrabee: a many-core x86 architecture for visual computing [C]. ACM SIGGRAPH 2008 papers. Los Angeles, California: ACM.
    [106] H.Q. Le. IBM POWER6 Microarchitecture [J]. IBM Journal of research and development, 2007, 51.
    [107] Iverson M,Ozguner F. Parallelizing existing applications in a distributed heterogeneous environment [C]. Proc. of the Heterogeneous Computing Workshop. Santa Barbara: IEEE Computer Society Press.
    [108] Wu M,Gajski D. A programming aid for message passing systems [J]. IEEE Trans. on Parallel and Distributed Systems, 1990, 1(3): 330-343.
    [109] Gschwind Michael, H. Peter Hofstee, Flachs Brian,et al. Synergistic Processing in Cell's Multicore Architecture [J]. 2006, 26: 10-24.
    [110] I. Gordon Michael, Thies William, Karczmarek Michal,et al. A stream compiler for communication-exposed architectures [C]. Proceedings of the 10th international conference onArchitectural support for programming languages and operating systems. San Jose, California: ACM.
    [111] W. S. Mong,J. Zhu. A retargetable micro-architecture simulator [C]. Proceedings of Design Automation Conference(DAC 2003). 2003: 752-757.
    [112] Charles Philippe, Grothoff Christian, Saraswat Vijay,et al. X10: an object-oriented approach to non-uniform cluster computing [C]. Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. San Diego, CA, USA: ACM.
    [113] Balakrishnan Saisanthosh,S. Sohi Gurindar. Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs [C]. Proceedings of the 33rd annual international symposium on Computer Architecture. IEEE Computer Society.
    [114] Yang T,Gerasoulis A. Scheduling parallel tasks on an unbounded number of processors [J]. IEEE Trans. on Parallel and Distributed Systems, 1994, 5(9): 951-967.
    [115] Joel M. Tendler. Power4 System Microarchitecture [R]. IBM, 2001.
    [116] Harris Tim,Fraser Keir. Language support for lightweight transactions [C]. Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications. Anaheim, California, USA: ACM.
    [117] Adam TL, Chandy KM,Dickson J. A comparison of list scheduling for parallel processing systems [J]. Communications of the ACM, 1974, 17(12): 685-690.
    [118] H. Tomiyama, A. Halambi, N. Dutt P. Grun,et al. Architecture description languages for system-on-chip design [C]. Proceedings of The Sixth Asia Pacific Conference on Chip Design Language (APCHDL'99). 1999.
    [119] G. Della Vecchia,C. Sanges. A recursively scalable network VLSI implementation [J]. 1988, 4: 235-243.
    [120] Scott Vetter. IBM Sytem p5 Quad-core Module Based on Power5+ Technology: Technical Overview and Introduction [EB/OL]. (2006) http://www.readbooks.ibm.com.
    [121] Ji Weixing, Shi Feng, Qiao Baojun,et al. The Study of an Interconnection Network for Complex Embedded Systems [J]. High Technology Letters, 2007, vol. 20(20).
    [122] Zhang Xiandi,Li Zhengliang. Graph Theory and Its Applications [M]. Beijing: High Education Press, 2005:78-87.
    [123] Li Xiaobo, Lu Paul, Schaeffer Jonathan,et al. On the versatility of parallel sorting by regular sampling [J]. 1993, 19: 1079-1103.
    [124] Chung YC,Ranka S. Application and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors [C]. Werner R, ed. Proceeding of the Supercomputing. IEEE Computer Society Press, 1992: 512-521.
    [125] Kwok YK,Ahmad I. Dynamic critical-path scheduling: An effective technique for allocating task graphs onto multiprocessors [J]. IEEE Trans. on Parallel and Distributed Systems, 1996, 7(5): 506-521.
    [126]杜晓丽,蒋昌俊,徐国荣,et al.一种基于模糊聚类的网格DAG任务图调度算法[J].软件学报, 2006, 17(11): 2277-2288.
    [127]计卫星,石峰,乔宝军,et al.一种面向复杂嵌入式系统的互连网络研究[J].高技术通讯, 2007,第20期.
    [128]林海波,谢海波,邵凌,et al. Cell BE处理器编程指南[M].北京:电子工业出版社, 2008.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700