基三体系结构存储系统相关问题的研究

作者：刘梦晓
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：片上多核处理器 ; 面向对象 ; 层次化存储结构 ; 存储映射 ; 消息机制 ; 对象管理 ; 内存访问策略
英文关键词：multi-core processor ; CMP ; object-oriented ; memory architecture ; cache mapping ; object management ; memory access scheduling
学位年度：2010
导师：石峰
学科代码：081202
学位授予单位：北京理工大学
论文提交日期：2010-06-26
答辩委员会主席：林守勋

摘要

众所周知,多核处理器体系结构是下一代微处理器体系结构的主流,而该结构的一种具体实现方式为片上多核处理器(CMP),也称为多核微处理器。设计CMP的首要问题是选择程序执行模型,只有适当的执行模型才能最大限度地挖掘程序的并行性,最大化地利用CMP中多处理节点的潜能。面向对象是一种基于人类认知规律的方法学,它所倡导的抽象分层原则是解决复杂问题的关键。经过几十年的发展,面向对象模型已经显现出其适用于作为并发和分布式系统架构的前途。但是,计算机体系结构与程序设计语言之间越来越大的鸿沟导致程序运行效率越来越低,而当程序结构贴近目标机器多核体系结构时,机器的能力将会被彻底挖掘出来。因此若能针对复杂问题的分解方式,将面向对象模型与多核处理器体系结构相结合,设计支持对象运行的多核处理器,不仅可以解决多核处理器下并行编程的问题,而且能够提高面向对象程序的运行效率。此外,由于传统共享总线通信结构中存在的延迟、通信性能瓶颈以及设计效率等问题,片上网络(NoC)被认为是一种更适合于构建多核系统的方式。
     本文基于一个以片上基三互连网络为拓扑结构搭建的支持面向对象范型的基三多内核体系结构——TriBA(Triple-based Architecture),并重点研究TriBA中存储体系结构的相关问题,以及支持面向对象思想的关键技术。研究内容及成果主要包括:
     (1)提出一种适用于TriBA体系结构的层次化分布式共享存储体系结构——HSMA(Hierarchical Shared Memory Architecture),以及适用于该存储结构的部分包含存储映射策略。HSMA在并行多端口共享存储器的支持下,采用了将分布式存储器和共享存储器相结合的方法,有效的体现并支撑了TriBA的运算和通讯局域性,为TriBA提供了高效的存储访问支持。给出了HSMA的具体设计方案及实现策略,对分布式系统的存储体系结构设计具有很好的指导意义。此外,结合TriBA支持面向对象的特点,设计了一种新的部分包含缓存映射策略,研究表明该种方法着重关注系统层次特性、面向对象特性及协同工作特点,对于HSMA具有很好的适应性和灵活性。
     (2)研究TriBA系统对面向对象思想的支持方法和实现机制。介绍了面向对象程序在TriBA中的软件运行模型,并从面向对象特征实现和面向对象语法支持的角度提出了具体方案。由于TriBA中以对象为基本处理单元,因此为对象在系统中的存储、访问和管理设计了对象全局唯一标识和分类多级对象表结构。此外,设计了用有限的对象标识位数表示系统中不断生成的对象的方法,即可重用对象表及对象编号重用算法。最后,给出了对象寻址过程,包括对象地址空间划分和访问及对象地址转换。
     (3)设计TriBA片上多核系统中用于系统通讯的消息机制。给出了消息请求和消息对象的具体格式,以及消息机制的运行方法。在比较了基三网络与2D-Mesh的拓扑特点之后,得出了基三网络用来构建多核处理器上的核间互连网络的优越性。针对TriBA基三网络的传输瓶颈问题,充分利用基三网络的局域性特点,提出了提高系统通讯效率的两点措施,并重点探讨了TriBA层次化分布式共享存储结构对系统通讯的辅助作用。结合通讯方式及消息内容,为系统设计了六种通讯消息类型,其中具体研究了每种消息的运行过程和适用领域。最后从理论上和模拟实验结果上证明该种消息机制的高效性及适用条件。
     (4)提出一种公平的动态分时复用共享存储器带宽的存储访问调度机制——DBTDMA(Dynamic Bandwidth Time Division Multiplex Access),用于解决多核计算机系统中由于处理核间竞争共享资源而加重存储墙的问题。随着处理器性能的不断提高,存储系统的速度已经逐渐成为整个系统的性能瓶颈。随着处理器中处理核数量以及应用程序中线程数量的增加,越来越多的应用程序的性能将受制于作为共享资源的处理器存储系统带宽的限制,这个问题在TriBA系统中体现在同组处理核对组内共享存储与组间共享存储之间数据通道的争用。为共享存储器设计一种新的访存请求响应的调度机制,并提出一种可变优先级仲裁及调整策略,实现对多处理核访存请求的公平响应和动态规划。实验结果表明DBTDMA机制避免了访存请求无法预料的长时间等待或饿死,并且缩短了存储访问响应的平均延迟。
It is well known that multi-core processor is the mainstream of the next generation of computer architecture, and the chip multiprocessors (CMP) will be the dominant design paradigm of this area. In order to provide higher potential performance of multi-cores and realize bigger parallelism in programs, the programming model becomes the most important part of CMP design. Object-oriented paradigm (OOP) is based on the cognizance method of human being, the information hiding and data abstraction principles of OOP is the key of resolving complex problems and building parallel programming. Although, after several decades of development, it is widely accepted that OOP can improve code reusability and facilitate code maintenance, and software engineers and developers embrace object-oriented programming for benefits. However, the performance of object-oriented programs running on the non-object-oriented processors is always lower than procedure-oriented programs. As objects are independent of each other and have natural parallel essence, a multi-core processor supporting object-oriented computing not only alleviates the burden of parallel program design, but also can accelerate the execution of object-oriented programs. On the other hand, classical on-chip communication architecture uses a traditional Time-Division Multiplexed (TDM) bus. Bus-based architecture suffers from the clear bottleneck of the share media used for the transmission. Network on Chip (NoC), a new chip design paradigm, is expected to be an important architectural choice for CMP. Using network to replace global wiring has advantages of structure, performance and modularity. So a novel architecture which is based on the NoC and supports the object-oriented technology will become the major trend in the design of future generations of micro architectures. Researches in this dissertation were based on a novel object-oriented multi-core architecture named TriBA (Triplet-based Architecture). This dissertation was focused on the key aspects of memory architectures and object-oriented scheme. The main research works and contributions of this dissertation are listed as follows:
     (1) A novel Hierarchical Shared Memory Architecture (HSMA) which is suitable for TriBA architecture as well as its partially-inclusive memory mapping scheme was proposed. With the support of multi-ports shared storage, HSMA combined the distributed memory and shared memory to build a high efficiency memory system for TriBA. This work introduced the design and implement of HSMA, and the analysis of the structure shown that HSMA fully utilized the operation and communication localization of TriBA. HSMA has been proven to have superior arrangement, adaptability and flexibility. Besides, the partially-inclusive memory mapping scheme used on TriBA is very appropriate for object-oriented memory management.
     (2) Research on the support method and implement scheme for object-oriented paradigm in TriBA. TriBA as an object-oriented processor can achieve object properties and object operation in both software and hardware level. This work proposed an object supporting scheme including object mapping, object denotation, object realization and object management. Using object identifier as the reference of object, object indirect addressing achieved using multilevel object table. Object addressing space and address mapping process also explained.
     (3) Communication scheme design which was based on message passing for TriBA is proposed. Message passing was the only way of the communication among cores and objects in TriBA. First of all, the detailed format of the message, the definition of the message class and the message scheme were shown. Secondly, after comparing the topology properties of triplet-based network with 2D-Mesh interconnection, a novel strategy to provide high-performance communication for TriBA was introduce, which used the data channels along with the on-chip inter-core channels to transfer message and data. Thirdly, messages classified to six kinds according to the transaction mode. Simulation results shown that this strategy could enhance the efficiency of communication in TriBA.
     (4) A novel fair Dynamic Bandwidth Time Division Multiplex Access (DBTDMA) scheme for memory access scheduling was proposed. Since the gap between the speed of the processor and storage system, the performance of the whole processor is enslaved to the efficiency of the memory system. In object-oriented on-chip multi-core systems, the memory bandwidth is the key shared resource among cores, the memory access become the bottleneck of the performance. In order to improve the efficiency of the memory system, this work put forward a kind of scheme which made multi-cores sharing the memory bandwidth dynamically. Assistant with the alterable access priority of cores, DBTDMA can provide fairly memory access among cores and shorten the memory access latency.

引文

[1] Brad Calder, Dirk Grunwald, Benjamin Zorn. Quantifying Behavioral Difference Between C and C++ programs. Journal of Programming Languages, VOL. 2, NO.4, 1994, 313-351.
    [2] L Benini, G De Micheli. Networks on chips: a new SOC paradigm[J]. IEEE Computer, 2002, 35(1): 70-78.
    [3] Pierre Guerrier, Alain Greiner. A Generic Architecture for On-chip Packet-switched Interconnections. DATE'2000 Conference[C], France IEEE Computer Society, 2000: 250-256.
    [4] Kunle Olukotun, Lance Hammond, James Laudon. Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency. Morgan & Claypool Publishers.
    [5] L. Hammond, B. A. Nayfeh, K. Olukotum. A Single-chip multiprocessor. IEEE Computer, 1997, 30(9):79-85
    [6]石峰,计卫星,乔保军,刘滨.一种新的非冯.诺依曼计算机体系结构TriBA.北京理工大学学报,2006, 26(10):847-849.
    [7] Daniel Wiklund and Dake Liu. Design of a System-on-Chip Switched Network and its Design Support. IEEE 2002 International Conference on Communications, Circuits and Systems and West Sino Expositions[C]. China IEEE Computer Society Press, 2002, vol. 2:1279-1283.
    [8] C A Zeferino, M E Kreutz, L Carro. A study of communication issues for systems-on-chip. 15th Symposium on Integrated Circuits and Systems Design[C]. Brazil, 2002:121-126.
    [9]乔保军.基三多内核体系结构中互连关键技术研究[D].北京:北京理工大学图书馆,2007.
    [10] Lu Peng; et al. Memory Performance and Scalability of Intel's and AMD's Dual-Core Processors: A Case Study[C]. IEEE Internationa Performance, Computing, and Communications Conference. USA IEEE Computer Society, 2007:55-64,.
    [11] Shreekant Thakkar. Second-Generation Intel Centrino Mobile Technology Platform[J]. Intel Technology Journal, 2005, 9(1): 1-10.
    [12] Conway, P. et al. The AMD Opteron Northbridge Architecture[J]. IEEE Micro, 2007, 27(2):10-21.
    [13] Keltcher, C.N. et al. The AMD Opteron Processor for Multiprocessor Servers[J]. IEEE Micro, 2003, 23(2):66-76.
    [14] Ron Kalla, Balaram Sinharoy, Joel M Tendler. IBM Power5 Chip: A Dual-Core Multithread Processor[J]. IEEE Micro, 2004, 24(2):40-48.
    [15] Joel M. Tendler, Steve Dodson, Steve Fields, etc. POWER4 System Microarchitecture[J]. IBM Journal of Research and Development, 2002.1, Vol. 46, No. 1, 5-26.
    [16] H Q Le, W J Starke. IBM Power6 Mcroarchitecture[J]. IBM Journal of Research and Development, 2007, 51(6):639-662.
    [17] Barroso, L.A. et al. Piranha: A Scalable Architecture Based on Single-chip Multiprocessing. 27th International Symposium on Computer Architecture[C]. USA IEEE Computer Society Press, 2000:282- 293.
    [18] Kongetira, P. et al. Niagara: a 32-way Multithreaded Sparc Processor[J]. IEEE Micro, 2005, 25(2):21-29.
    [19] Nawathe, U.G. et al. Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip[J]. IEEE Journal of Solid-State Circuits, 2008, 43(1):6-20.
    [20] Wentzlaff, D. et al. On-Chip Interconnection Architecture of the Tile Processor[J]. IEEE Micro, 2007, 27(5):15 - 31.
    [21] Michael Bedford Taylor, Jason Kim, etc. The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro, Vol.22, 2002.1.
    [22] Wei-Wu Hu, Fu-Xin Zhang, Zu-Song Li. Microarchitecture of the Godson-2 Processor[J]. Journal of Computer Science and Technology, 2005, 20(2): 243-249.
    [23]胡伟武.龙芯3号多核处理器设计及其挑战.龙芯官方网站:http://www.loongson.cn., 2009.9
    [24] Tran, V.D., Hluchy, L., Nguyen, G.T.. Parallel programming with data driven model. In: 8th Euromicor Workshop on Parallel and Distributed Processing, IEEE Computer Society Press , Greece, Jan. 2000, pp. 205-211.
    [25] Schwan, K., Ramnath, R., Vasudevan, S., Ogle, D.. A language and system for the construction and tuning of parallel programs. IEEE Transactions on Software Engineering, Apr 1988, pp. 455– 471.
    [26] David Robson. Object-Oriented Software Systems[J]. In Tutorial: Object-Oriented Computing, volume 1: Concepts. IEEE Computer Society Press, Byte, 1981, pp. 5-8, 6(8):74-86.
    [27]冯玉琳,黄涛,倪彬.对象技术导论[M].北京:科学出版社,1998.3.
    [28] R. Radhakrishnan, L. John. Execution Characteristics of Object Oriented Programs on the UltraSPARC-II. 5th International Conference on High Performance Computing[C]. India IEEE Computer Society, 1998, pp:202-211.
    [29] Alexander Chatzigeorgiou. Perfromance and Power Evaluation of C++ Object-orientedProgramming in Embedded Processors[J]. Information and Software Technology, 2003, 45(1): 195-201.
    [30] Mok Pak Lun, Anthony Fong, Gary K.W. Hau. Object-Oriented Processor Requirements with Instruction Analysis of Java Programs[J]. ACM Sigarch Computer Architecture News, 2003, pp. 31(5):10-15.
    [31] Henry M Levy. Capability-Based Computer Systems[M]. Maynard, Mass: Digital Press, 1984.
    [32] R. Colwell, E. Gehringer, and E. Jensen. Performance effects of architectural complexity in the intel 432[J]. ACM Transactions on Computer Systems, 1988, VOL.6, NO.3, pp. 296-339.
    [33] Edward F. Gehringer and Robert P. Colwell. Fast Object-Oriented Procedure Calls: Lessons From the Intel 432. 13th Annual Symposium on Computer Architecture[C]. Japan IEEE Computer Society Press, 1986:92-101.
    [34] Pollack, Fred J., Cox, George W., et al.. Supporting Ada Memory Management in the iAPX-432. First international Symposium on Architectural Support for Programming Languages and Operating Systems[C]. USA ACM Press, 1982:117-131.
    [35] Fred J. Pollack, Kevin C. Kahn, Roy M. Wilkinsom, The iMAX-432 Object Filing System. Proceedings of the Eighth Symposium on Operating System Principles[C]. USA ACM Press, 1981:137-147.
    [36] G Russell, P Cockshott. A Survey of Architectures for Memory Resident Databases[R], Dept. of Computer Science, University of Strathclyde, 1993.
    [37] Paul Cockshott. Performance evaluation of the Rekursiv object oriented computer[C]. Proceedings of the Hawaii International Conference on System Sciences. Hawaii, USA:IEEE Computer Society Press, 1992:730–736.
    [38] Harrison A, Moulding M R. Data fusion on the Rekursiv object-oriented architecture[J]. IEE Colloquium on Principles and Applications of Data Fusion, 1991, 1-3.
    [39] Wolczko M, Williams I. The Influence of the Object-oriented Language Model on a Supporting Architecture[C]. Proceeding of the 26th Hawaii International Conference on System Sciences. Hawaii, USA: IEEE Computer Society Press, 1993: 182-191.
    [40] Ifor W. Williams.Object-based memory architecture[D].Manchester:university of Manchester, 1989.
    [41] D Ungar, R Blau, P Foley, D Patterson. Architecture of SOAR: Smalltalk on a RISC. Eleventh Annual International Symposium on Computer Architecture[C]. USA: ACM Press, 1984:188-194.
    [42] A Dain Samples, David Ungar, Paul Hilfinger. SOAR: Smalltalk without bytecodes, Conferenceproceedings on Object-oriented programming systems, languages and applications[C]. USA: ACM Press, 1986:107-118.
    [43] William R Bush, A Dain Samples, David Ungar. Compiling Smalltalk-80 to a RISC. Second international conference on Architectual support for programming languages and operating systems[C]. USA IEEE Computer Society Press, 1987:112-116.
    [44] Alan D Samples, Mike Klein, Pete Foley. SOAR Architecture[R], CSD-85-226. University of California at Berkeley Berkeley, 1985.
    [45] HISC Development Group. High-level Instruction Set Computer[EB/OL]. http://www.ee.cityu.edu.hk/~hisc/architecture.html, 2003.
    [46] C H Tam, Anthony S Fong. Operand Type and Access Control of Java on a Descriptor Compter: HISC. 12th International Conference on Computer Application in Industry and Engineering[C]. USA: IEEE Computer Society Press, 1999:191-195.
    [47] Mok Pak Lun, Anthony S Fong. Introducing Pipelining Technique in an Object-Oriented Processor. 17th IEEE Region 10 International Conference on Computers, Communications, Control and Power Engineering[C]. China IEEE Computer Society Press, 2002:301-305.
    [48] Lindho lm T, Yellin F.. The Java Virtual Machine Specification[M]. MA: Addison-Wesley, 1996.
    [49]徐科,忻凌,朱柯嘉,闵昊.一种新体系结构、带Java功能的32位嵌入式微处理器设计[J],小型微型计算机系统, 2005, 26(1): 90-95.
    [50]张建杰,杨之廉,葛元庆. JAVA智能卡微处理器的设计与验证[J].清华大学学报(自然科学版), 2002, 42(1): 104-107.
    [51]唐小勇.应用于智能卡的Java嵌入式微处理器核的设计[J].微电子学, 2000, 30(6): 382-386.
    [52] O'Connor J M, Tremblay M. PicoJava-I: Java Virtual Machine in Hardware[J]. IEEE Micro, 1997, 17(2): 45-53.
    [53] McGhan H, O'Connor J M. PicoJava: A Direct Execution Engine for Java Butecode[J]. IEEE Computer, 1998, 31(10): 22-30.
    [54] N Vijaykrishnan, N Ranganathan, R Gadekarla. Object-Oriented Architectural Support for a Java Processor. ECOOP’98[C]. Belgium: LNCS 1445, 1998:330-355.
    [55] Feng SHI, Weixing JI, Baojun QIAO, Bin LIU, Haroon-ul-Rashid. A Triplet Based Computer Architecture Supporting Parallel Object Computing. IEEE 18th International Conference on Application-specific Systems, Architectures and Processors (ASAP’07). Montreal, Quebec,Canada:IEEE Computer Society, 2007, pp.192-197.
    [56]李祖松,许先超,胡伟武等.同时多微线程体系结构研究[J],计算机研究与发展, 2007, 44(5):768-774.
    [57] Bumyong Choi et al. Accurate Branch Prediction for Short Threads. 13th International Conference on Architectural Support for Programming Languages and Operating Systems[C]. USA ACM Press, 2008: 125-134.
    [58] Jose A. Joao et al. Improving the Performance of Object-oriented Languages with Dynamic Predication of Indirect Jumps. 13th International Conference on Architectural Support for Programming Languages and Operating Systems[C]. USA ACM Press 2008: 125-134.
    [59] Sivarama P. Dandamudi. Hierarchical Interconnection Networks for Multicomputer Systems[J]. IEEE Transactions on Computers, 1990, 39(6): 786-797.
    [60] David E Culler, J P Singg. Parallel Computer Architecture : A Hardware/ Software Approach[M]. Morgan Kaufmann, Inc., 1998.
    [61] Intel Corporation. Paragon XP/S Product Overview Supercomputer Systems Division[M]. Beaverton, 1991.
    [62] Daniel Lenoski, James Laudon, Kourosh Gharachorloo. The Stanford Dash Multiprocessor[J]. IEEE Computer. 1992, 25(3)63-79.
    [63] Peter R Nuth, W J Dally. The J-Machine Network. International Conference on Computer Design : VLSI in Computers and Processors[C]. Cambridge IEEE Computer Society, 1992:420-423.
    [64] S Scott, G Thorson. Optimized Routing in the Cray T3D[J]. Lecture Notes in Computer Science, 1994, 853:281-294.
    [65] G. D. Vecchia and C. Sanges. A recursively scalable network VLSI implementation. Future Generation Computer Systems, October 1988, pp. 235-243.
    [66]乔保军,石峰,计卫星.用于多核处理器核间互连的新型互连网络.北京理工大学学报,2007,27(6):511-516.
    [67] S. Kumar, A. Jantsch, et al. A network on chip architecture and design methodology. In Proc. of IEEE Computer Society Annual Symposium on VLSI[C]. Pittsburgh, Pennsylvania, USA, IEEE Computer Society Press, April 2002, pp. 117-124.
    [68] W.J. Dally. Performance analysis of k-ary n-cube interconnection networks[J]. IEEE Transcations on Computers, 1990, 39(6):775-785.
    [69]乔宝军.基三多内核体系结构中互连关键技术的研究[D].北京:北京理工大学. 2008.
    [70]王佐.基三多核处理器片上存储系统若干关键技术的研究[D].北京:北京理工大学. 2009.
    [71] Angela C. Sodan, Jacob Machina, Arash Deshmeh, Kevin Macnaughton, and Bryan Esbaugh, parallelism via multithreaded an multicore cups,University of Windsor, Canada
    [72] Weixing JI, Feng SHI, Baojun QIAO, Muhammad Kamran. The design of a novel object processor: OOMIPS. IEEE 18th International Conference on Application-specific Systems, Architectures and Processors (ASAP’07), Montreal, Quebec, Canada, July 8-11, 2007.
    [73] Ji Weixing, Shi Feng, Qiao Baojun. Multi-port Memory Design Methodology Based on Block Read and Write [C]. The 6th IEEE International Conference on Control and Automation (ICCA’07), Guangzhou, China, 2007.5,976-979.
    [74] Wang Zuo and Shi Feng et al, N-Port Memory Mapping for LUT-Based FPGAs. The 17th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’09). Monterey, California:ACM Press, 2009: 279-279.
    [75] Roger Dettmer, The Rekursiv commuter, IEE REVIEW, 1988.
    [76] Caixia Liu, Jiaxin Li, Hongli Zhang, Qi Zuo. HHMA: a hierarchical hybrid memory architecture sharing multi-port memory. 2008 9th International Conference for Young Computer Scientists, 2008, p 1320-5.
    [77] John L Hennessy and David A Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman, CA, 1996.
    [78] Weixing JI, Feng SHI, Baojun QIAO. A Self-Maintained Memory Module Supporting DMM. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems(CASES’07) [C]. Austria:ACM Press, 2007:189-197.
    [79]计卫星.基三多内核体系结构关键技术研究[D].北京:北京理工大学. 2008.
    [80]刘滨,石峰.基三分层互连网络中负载平衡的研究与仿真[J].系统仿真学报,2006.8,18(2):781-784.
    [81] Bin LIU, Feng SHI. Homogeneous multiprocessor system-oriented dynamic load balancing algorithm. Computer Engineering and Design, March 2007, pp. 1327-1333.
    [82] Dieckmann S., H?lzle U.. A Study of the Allocation Behavior of the SPECjvm98 Java Benchmarks. In Proceedings of the European Conference on Object-Oriented Programming(ECOOP’99), June 1999, pp. 92-115.
    [83] Weixing Ji, Feng Shi, Baojun Qiao, Qi Zuo, Caixia Liu. Performance Evaluation of a Self-Maintained Memory Module. RTSS’07, 2007, pp. 254-266.
    [84] www.sourceforge.net.
    [85] Greg Wright, Matthew L Seidl, Mario Wolczko. An object-aware memory architecture[R], SMLI-TR-2005-143. Sun Technical report, 2005.
    [86]乔保军等,基三分层网络中的受限多播路由算法,计算机应用, Apr 2007, Vol 27 NO.4.
    [87] J. Liang, S. Swaminathan, R. Tessier. aSOC: a scalable, single-chip communication architecture. PACT’2000[C]. Los Alamitos IEEE Computer Society, October 2000: pp. 37-46.
    [88] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta.并行计算机体系结构:硬件/软件结合的设计与分析.北京:机械工业出版社,2002.10.
    [89]陈国良,吴俊敏,章锋,章隆兵.并行计算机体系结构.北京:高等教育出版社,2002.9.
    [90] Duato J , Yalamanchili S , Ni L. Interconnection net-works : An engineering approach[M]. Los Alamitos , CA ,USA : The IEEE Computer Society Press , 1997.
    [91] Anant Agarwal. Limits on interconnection network performance. IEEE Transactions on Parallel Distributed Systems, 2(8):434-448, 1991.
    [92] P. Barham, B.Dragovic, K. Fraser, S. Hand, T, Harris, A. Ho, I. Pratt, A. Warfield, P. Barham, and R. Neugebauer,“Xen and the art of virtualization,”In Proc. of ACM SOSP, October 2003.
    [93] PATTERSON, DAVID, ET AL. A Case for Intelligent RAM. IEEE Micro (March/April 1997), pp. 34-44.
    [94] SAULSBURY, ASHLEY, PONG, FONG, AND NOWATZYK, ANDREAS. Missing the Memory Wall: The Case for Processor/Memory Integration. Proceedings of the International Symposium on Computer Architecture (May 1996), pp. 90-101.
    [95] Wm. A. Wulf and Sally A McKee. Hitting the Memory wall: Implications of the Obvious. Computer Architecture News, 23(1), pp. 20-24, March 1995.
    [96] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley, December 18 2006.
    [97] S. I. Association. ITRS Roadmap, 2005.
    [98] D. Burger, J. R. Goodman, and A. Kagi. Memory bandwidth limitations of future microprocessors.Proceedings of the 23rd annual international symposium on Computer architecture, Assoc. of Computing Machinery, ACM Press, New York, NY, USA, Aug. 1996, pp.79-90.
    [99] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. Memory access scheduling. ISCA-27, 2000.
    [100] Onur Mutlu , Thomas Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, p.146-160, December 01-05, 2007.
    [101] Kim, S., Chandra, D., and Solihin, Y.. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. Proc. of the 13th Intl.l Conf. on Parallel Arch. and Compilation Techniques (Sept. 29– Oct. 03, 2004). PACT‘04. 111-122.
    [102] Ben Verghese, Anoop Gupta, Mendel Rosenblum, and Mendel Rosenblum. Performance isolation: sharing and isolation in shared-memory multiprocessors. Proc. of the 8th Intl. Conf. on Arch. Support For Prog. Lang. and Op. Sys. (San Jose, CA, Oct. 02 - 07,1998). ASPLOS-VIII. 181-192.
    [103] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. MICRO-39, 2006.
    [104] McKee, S.A.. Dynamic Access Ordering: Bounds on Memory Bandwidth. University of Virginia, Technical Report CS-94-14, April 1994.
    [105] http://www.freebench.org/
    [106] http://www.bitmover.com/lmbench/
    [107] http://www.spec.org
    [108] http://sourceforge.net/projects/aimbench

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700