基于图着色的存储层次优化技术研究

英文题名：Research on Graph Coloring Based Optimization Technologies for Memory Hierarchies
作者：邓宇
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：存储层次优化 ; Cache优化 ; 流寄存器文件分配 ; 编译时存储分配 ; 图着色
英文关键词：memory hierarchies optimization ; cache optimization ; stream register file allocation ; compile-time memory allocation ; graph coloring
学位年度：2007
导师：杨学军
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2007-10-01

摘要

处理器与存储器的性能差距导致了“存储墙”问题的出现,使得存储系统成为计算机系统的瓶颈。从目前工艺水平和体系结构技术的发展趋势来看,这种差距还会继续增加,因此在未来可预测的范围内,对存储系统的优化将一直是提高计算机系统性能的关键技术之一。
     本文着重研究了如何将图着色理论应用于各级存储层次的优化问题,在cache、以流寄存器文件为代表的片上大容量寄存器文件和主存三方面提出了创新的编译时优化方法。本文取得的主要研究成果如下:
     (1)提出了一个基于图着色的cache优化算法—Cache Coloring。该算法根据访存行为将程序中的数据划分成若干数据对象,然后根据数据对象的大小将cache划分为一个带有别名的伪寄存器文件,每个伪寄存器由若干cache行组成,可以容纳一个数据对象;最后使用一个经过改进的图着色寄存器分配算法来决定这些对象在cache中的位置以及发生冲突时的替换关系。数据对象的划分将cache的管理分为两个层次,一个是编译时编译器对粗粒度的数据对象的管理,另一个是运行时硬件对细粒度的cache行的管理,这样编译器和硬件的优势都得到发挥。我们构造了比传统的生命周期相干图蕴涵更多相干信息的冲突矩阵,作为处理寄存器分配冲突时的指导原则。我们基于GCC进行了实现,并通过simplescalar构造了支持CacheColoring的硬件模拟平台。实验结果表明Cache Coloring能较好的开发程序的局部性,降低cache失效率。
     (2)提出了一个基于图着色的流寄存器文件分配算法—SRF Coloring。该算法通过寄存器划分将流寄存器文件转换为大小和位置固定的传统寄存器文件,从而将流寄存器文件的分配问题归结为一个可以采用图着色寄存器分配算法解决的问题。针对流寄存器文件的硬件机制和程序访问流寄存器文件的特点,我们对已有的图着色算法进行了扩充,使之能够更加有效地进行流寄存器文件的分配。为了解决因流太长而导致相干图不可着色的问题,我们提出了重用优先的双缓冲策略。我们在SF95Compiler编译框架中实现了SRF Coloring。SF95Compiler是一个为FT64及其编程语言SF95开发的编译器。实验结果表明,SRF Coloring能够有效地管理流寄存器文件。
     (3)提出了一个基于区间着色的主存分配算法—MM Coloring。我们以数组作为分配主存空间的候选,将主存分配问题归结为区间着色问题。由于一般的区间着色问题是NP完全问题,我们利用程序所具有的一个性质,即数组生命周期的包含性来降低区间着色的难度,将一般的区间着色问题简化为超完美图的区间着色问题,并据此提出了实现最优区间着色的判定条件以及实现算法。当不满足最优区间着色的条件时,可以通过分割数组的生命周期来使之得到满足。我们提出了两种分割数组生命周期的策略。一种是自底向上的积极分割策略,它首先将数组的生命周期分割到最小,然后在满足着色条件的前提下逐步合并生命周期;另一种是自顶向下的被动分割策略,它在一开始将数组的生命周期保持为最长,只有在不满足着色条件时才选择某些生命周期进行分割。我们基于GCC进行了实现。模拟实验结果表明,该算法是一种有效的管理主存的编译方法。
The performance gap between processor and memory causes the problem known as "Memory Wall", which makes the memory system become the bottleneck of the computer system. The developing trends of current semiconductor and computer architecture techniques show that this gap will keep increasing. Consequently, the optimization to the memory system will be one of the key technologies to improve the overall performance of the computer system in the visible future.
     This thesis focuses on applying the graph coloring theory to the management optimizing problem of the memory hierarchies. Novel management approaches in compile-time have been presented in the aspects of cache, large on-chip register files represented by Stream Register File (SRF) and the main memory. Primary innovative works in this paper could be summarized as follows:
     (i) A graph coloring based management optimizing algorithm for cache, namely Cache Coloring, has been proposed. This algorithm first partitions the data into several data objects according to their memory accessing behaviors. Then it partitions the cache into a pseudo register file with alias according to the size of the data objects. Each pseudo register in this register file can hold one of the data objects. Finally, it uses an extended graph coloring register allocation algorithm to determine the position of each data object in the cache and their replacement relationship. The data object partitioning divides the management of cache into two levels, one for the coarse-granularity management of the data objects in the compile-time and the other for the fine-granularity management of the cache lines in the run-time. So the advantages of both compiler and hardware are exploited. The conflict matrix which contains more information about interference than the traditional live-range interference graph is constructed as a guide to deal with the conflicts of register allocation. Cache Coloring is implemented in GCC. A hardware simulation platform which supports Cache Coloring is built based on simplescalar. The primary experimental results show that Cache Coloring can exploit the locality well and reduce the cache miss rate.
     (ii) A graph coloring based SRF allocation algorithm, namely SRF Coloring, is proposed. This algorithm formulates the problem of SRF allocation into a register allocation problem which can be solved by a well understood graph coloring algorithm by partitioning the SRF into a traditional register file with fixed position and length. The existing graph coloring algorithm is extended and modified to address the unique aspect of the SRF allocation problems. A reusing-first double-buffering strategy is presented to address the uncoloralbe problem caused by very long streams. The SRF Coloring algorithm is implemented in the compiler framework of SF95Compiler, which is developed for FT64 stream processor and its programming language SF95. The experimental results demonstrate that SRF Coloring represents a promising compiler approach for the SRF management.
     (iii)An interval coloring based allocation algorithm for main memory, namely MM Coloring, is proposed. The main memory allocation problem is mapped to an interval coloring problem by taking arrays as candidates for memory allocation. As the general interval coloring problem is NP-complete, a property of the program, namely the containment of arrays' live-ranges, is used to decrease the hardness of interval coloring and simplify it as an interval coloring problem for superperfect graph. Based on this, a judgment theorem for optimal interval coloring together with the implementation algorithm is proposed. Live-range splitting can be used to make the graph satisfy the optimal interval coloring condition. Two strategies for live-range splitting are proposed. One is a top-down aggressive strategy which first splits the live-ranges into minimum and then merges them gradually when the coloring condition is satisfied. The other one is a bottom-up passive strategy which keeps the live-ranges as long as possible and splits some of them only when the coloring condition is not satisfied. MM Coloring with both aggressive and passive strategies is implemented in GCC. Modeling tests show that it is an efficient approach for compile-time memory allocation.

引文

[1] D.A. Patterson, J.L. Hennessy, Computer Architecture: A Quantitative Approach(Second Edition), China Machine Press, 1999.

    [2] W.A. Wulf, S.A. McKee, Hitting the Memory Wall: Implications of the Obvious,Computer Architecture News, 1995, 23(1):20-24.
    [3] P.J. Mucci, S. Moore, Performance Analysis of HPC Architectures. In HPC User Forum, Innovative Computing Laboratory, University of Tennessee: Princeton,NJ, 2003,

    [4] W.A. Hunt, The Memory Hierarchy, Report, 2003.
    [5] D. Burger, J.R. Goodman, A. Kagi, Memory Bandwidth Limitations of Future Microprocessors. In ISCA '96: Proceedings of the 23rd annual international Symposium on Computer architecture, ACM Press, 1996, 78-89.
    [6] S.E.Perl, RX.Sites, Studies of Windows NT Performance Using Dynamic Execution Traces. In Proceedings of 2nd USENIX Symposium on Operating Systems Design and Implementation, 1996,169-183.
    [7] S.Przybylski, Cache and Memory Hierarchy Design : A Performance Directed Approach, Morgan Kaufmann, 1990.

    [8] R.v. Pas, Memory Hierarchy in Cache-based Systems, Report, 2002.
    [9] W.-F.Lin, S.K.Reinhardt, D.Burger, Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In HPCA'01: Proceedings of the 7th International Symposium on High Performance Computer Architecture, 2001,301-312.
    [10] Intel, Intel(R) Itanium(R) 2 Processor Hardware Developer's Manual, http://www.intel.com/design/itanium2/manuals/25110901.pdf, 2002.

    [11] G.W. McFarland, CMOS Technology Scaling and Its Impact on Cache Delay,Stanford University, 1997.

    [12] O.S. Unsal, R. Ashok, I. Koren, C.M. Krishna, C.A. Moritz, Cool-cache for hot multimedia. In MICRO 34: Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society, 2001,274-283.
    [13] C.-L. Yang, H.-W. Tseng, C.-C. Ho, J.-L. Wu, Software-controlled cache architecture for energy efficiency, IEEE Transactions on Circuits and Systems for Video Technology, 2005,15(5):634-644.
    [14] D. Burger, J.R. Goodman, Billion-Transistor Architectures: There and Back Again, Computer, 2004, 37(3):22-28.
    [15] B. Khailany, W. Dally, S. Rixner, et al, Imagine: Media Processing with Streams, IEEE Micro, 2001,21(2):35-46.
    [16] U.J. Kapasi, W.J. Dally, S. Rixner, J.D. Owens, B. Khailany, The Imagine Stream Processor. In ICCD'02: Proceedings of 20th IEEE International Conference on Computer Design, 2002, 282-288.
    [17] W.J. Dally, P. Hanrahan, M. Erez, T.J. Knight, et'al, Merrimac: Supercomputing with Streams. In SC'03: Proceedings of Supercomputing Conference 2003, 2003,35-42.
    [18] D. Pham, S. Asano, M. Bolliger, et al, The Design and Implementation of a First-Generation CELL Processor. In ISSCC'05: Proceedings of IEEE International Solid-State Circuits Conference,2005,184-185.
    [19]J.d.Cuvillo,W.Zhu,H.u.Ziang,G.R.Gao,FAST:A Functionally Accurate Simulation Toolset for the Cyclops64 Cellular Architecture.In MOBS2005:Workshop on Modeling,Benchmarking,and Simulation,in conjuction with the 32nd Annual International Symposium on Computer Architecture(ISCA2005),ACM Press,2005,11-20.
    [20]ClearSpeed,CSX600 Processor Datasheet,http://www.clearspeed.com/,2005.
    [21]X.Yang,X.Yan,Z.Xing,Y.Deng,J.Jiang,Y.Zhang,A 64-bit Stream Processor Architecture for Scientific Applications.In ISCA'07:Proceedings of the 34th Annual International Symposium on Computer Architecture,2007,210-219.
    [22]R.Allen,K.Kennedy,Automatic Loop Interchange.In Prococeedings of the 1984 ACM SIGPLAN Compiler conference,1984,233-246.
    [23]D.F.Bacon,S.L.Graham,O.J.Sharp,Compiler Transformations for High-Performance Computing,ACM Comput.Surv.,1994,26(4):345-420.
    [24]R.Allen,K.Kennedy,Optimizing Compilers for Modern Architecture,A Dependence-Based Approach,Elsevier Science(USA),2001.
    [25]沈志宇,胡子昂,廖湘科,吴海平,赵克佳,卢宇彤,并行编译方法,北京:国防工业出版社,2000.
    [26]夏军,数据局部性及其编译优化技术研究,博士论文,国防科学技术大学计算机学院,2004.
    [27]M.C.Golumbic,Algorithmic Graph Theory and Perfect Graphs,Annals of Discrete Mathematics,2004.
    [28]D.B.West,李建中,骆吉洲译,图论导引(原书第2版),机械工业出版社,2006.
    [29]K.D.Cooper,T.J.Harvey,Compiler-controlled memory.In ASPLOS-VⅢ:Proceedings of the eighth international conference on Architectural support for programming languages and operating systems,ACM Press,1998,2-11.
    [30]N.Jayasena,Memory Hierarchy Design for Stream Computing,PhD thesis,Stanford University,2005.
    [31]P.Jain,S.Devadas,D.W.Engels,L.Rudolph,Software-Assisted Cache Replacement Mechanisms for Embedded Systems.In International Conference on Computer Aided Design,2001,119-126.
    [32]S.Carr,Memory-Hierarchy Management,PhD thesis,Rice University,1993.
    [33]A.S.Tanenbaum,A.S.Woodhull,王鹏,尤晋元,朱鹏,熬青云译,操作系统:设计与实现(第二版),电子工业出版社,1998.
    [34]S.V.Pemmaraju,R.Raman,K.Varadarajan,Buffer minimization using max-coloring.In SODA '04:Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms,Society for Industrial and Applied Mathematics,2004,562-571.
    [35]R.Bordawekar,A.Choudhary,K.Kennedy,C.Koelbel,M.Paleczny,A Model and Compilation Strategy for Out-of-core Data Parallel Programs.In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,ACM Press,1995,1-10.
    [36]R.Bordawekar,A.Choudhary,J.Ramanujam,Compilation and Communication Strategies for Out-of-Core Programs on Distributed Memory Machines,Journal of Parallel and Distributed Computing,1996,38(2):277-288.
    [37]M.Kandemir,A.Choudhary,J.Ramanujam,R.Bordawekar,Compilation techniques for out-of-core parallel computations,Parallel Computing,1998,24(3-4):597-628.
    [38]P.Brezany,A.N.Choudhary,M.Dang,Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors.In LCR'98:Selected Papers from the 4th International Workshop on Languages,Compilers,and Run-Time Systems for Scalable Computers,Springer-Verlag,1998,343-350.
    [39]D.A.Barrett,B.G.Zorn,Using lifetime predictors to improve memory allocation performance.In PLDI'93:Proceedings of the ACM SIGPLAN 1993conference on Programming language design and implementation,ACM Press,1993,187-196.
    [40]D.Grunwald,B.Zorn,CustoMalloc:Efficient Synthesized Memory Allocators,Software Practice&Experience,1993,23(8):851-869.
    [41]P.R.Wilson,M.S.Johnstone,M.Neely,D.Boles,Dynamic Storage Allocation:A Survey and Critical Review,Lecture Notes in Computer Science,1995:986.
    [42]GCC.ORG,http://gcc.gnu.org/gec-4.1/,2007.
    [43]Simplescalar,http://www.simplescalar.com/,2004.
    [44]N.P.Jouppi,Improving Direct-mapped CachePerformance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers.In ISCA'90:Proceedings of the 17th annual international Symposium on Computer architecture,1990,364-373.
    [45]C.May,E.Silha,R.Simpson,H.Warren,The PowerPC Architecture:A Specification for a New Family of RISC Processors,Morgan Kaufmann Publishers,Inc.,1994.
    [46]S.Microsystems,UltraSparc User's Manual,1997.
    [47]R.Kessler,The Alpha 21264 Microprocessor:Out-Of-Order Execution at 600Mhz.In Hot Chips 10,1998,
    [48]Intel,IA-64 Application Developer's Architecture Guide,1999.
    [49]C.-H.Chi,H.Dietz,Unified Management of Registers and Cache Using Liveness and Cache Bypass.In PLDI'89:Proceedings of the ACM SIGPLAN 1989 conference on Programming language design and implementation,1989,344-355.
    [50]杜红燕,田兴彦,田新华,一种新颖的软件可控Cache优化方法,计算机工程与应用,2005,21:52-57.
    [51]Z.Wang,K.S.McKinley,A.L.Rosenberg,C.C.Weems,Using the Compiler to Improve Cache Replacement Decisions.In PACT'02:Proceedings of the 2002International Conference on Parallel Architectures and Compilation Techniques,IEEE Computer Society,2002,199-210.
    [52]A.R.Lebeck,D.R.Raymond,C.-L.Yang,M.Thottethodi,Annotated Memory References:A Mechanism for Informed Cache Management.In European Conference on Parallel Processing,1999,1251-1254.
    [53]H.Yang,R.Govindarajan,G.Gao,H.u.Z.,Compilerassisted cache replacement:Problem formulation and performance evaluation.In LCPC'03:Proceedings of 16th International Workshop on Languages and Compilers for Parallel Computing,2003,
    [54]W.A.Wong,J.-L.Baer,Modified LRU Policies for Improving Second-Level Cache Behavior.In HPCA'99:Proceedings of the 6th International Symposium on High Performance Computer Architecture,1999,
    [55]K.Asanovic,Vector Microprocessors,PhD thesis,University of California at Berkeley,1998.
    [56]R.Espasa,F.Ardanaz,J.Emer,Tarantula:A Vector Extension to the Alpha Architecture.In ISCA'02:Proceedins of the 29th Annual International Symposium on Computer Architecture,2002,281-292.
    [57]N.Corporation,SX-6 Series CPU Functional Description Manual,2002.
    [58]陆洪毅,32位高性能嵌入式向量微处理器关键技术的研究与实现,博士论文,国防科技大学计算机学院,2002.
    [59]D.Patterson,T.Anderson,N.Cardwell,Intelligent RAM(IRAM):Chips that Remember and Compute.In Proceedins of the International Symposium on Solid-State Circuits,1997,224-225.
    [60]J.Draper,J.Chame,M.Hall,e.al,The Architecture of the DIVA Processing-In-Memory Chip.In ICS '02:Proceedings of the 16th annual international conference on Supercomputing,2002,14-25.
    [61]C.Kozyrakis,Scalable Vector Media-Processors for Embedded Systems,PhD thesis,University of California at Berkeley,2002.
    [62]温璞,面向科学计算的PIM体系结构技术研究,博士论文,国防科技大学计算机学院,2007.
    [63]R.Allen,K.Kennedy,Vector Register Allocation,IEEE Trans.Comput.,1992,41(10):1290-1317.
    [64]U.J.Kapasi,P.Mattson,W.J.Dally,J.D.Owens,B.Towles,Stream Scheduling.In Proceedings of the 3rd Workshop on Media and Streaming Processors,Stanford University,2001,101-106.
    [65]P.Mattson,A Programming System for the Imagine Media Processor,PhD thesis,Stanford University,2002.
    [66]P.Mattson,U.Kapasi,J.Owens,S.Rixner,Imagine Programming System User's Guide,Report,2002.
    [67]K.Fatahalian,D.R.Horn,Y.J.Knight,et al.,Sequoia:programming the memory hierarchy.In SC '06:Proceedings of the 2006 ACM/IEEE conference on Supercomputing,ACM Press,2006,83
    [68]P.R.Panda,N.D.Dutt,A.Nicolau,Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications.In EDTC '97:Proceedings of the 1997European conference on Design and Test,IEEE Computer Society,1997.
    [69]R.Banakar,S.Steinke,B.-S.Lee,M.Balakrishnan,P.Marwedel,Scratchpad memory:design alternative for cache on-chip memory in embedded systems.In CODES'02:Proceedings of the tenth international symposium on Hardware/software codesign,ACM Press,2002,73-78.
    [70]O.Avissar,R.Barua,D.Stewart,An optimal memory allocation scheme for scratch-pad-based embedded systems,Trans.on Embedded Computing Sys.,2002,1(1):6-26.
    [71]J.Sjodin,C.v.Platen,Storage allocation for embedded processors.In CASES '01:Proceedings of the 2001 international conference on Compilers,architecture,and synthesis for embedded systems,ACM Press,2001,15-23.
    [72]M.Kandemir,J.Ramanujam,J.Irwin,N.Vijaykrishnan,I.Kadayif,A.Parikh,Dynamic management of scratch-pad memory space.In DAC '01:Proceedings of the 38th conference on Design automation, ACM Press, 2001, 690-695.
    [73] S. Udayakumaran, R. Barua, Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, ACM Press, 2003,276-286.
    [74] Y. Zhao, K. Kennedy, Dependence-based Code Generation for a CELL Processor. In LCPC'06: Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing, 2006,
    [75] A.E. Eichenberger, K. O'Brien, K. O'Brien, et al, Optimizing Compiler for the CELL Processor. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society,2005, 161-172.
    [76] A.E. Eichenberger, K. O'Brien, K. O'Brien, et al, Using advanced compiler technology to exploit the performance of the Cell Broadband Enginee architecture, IBM Systems Journal, 2006, 45(1):59-84.
    [77] A.W. Appel, Modern Compiler Implementation in C, Cambridge University Press, 1998.
    [78] K. Kennedy, Global Flow Analysis and Register Allocation for Simple Code Structures, PhD thesis, Courant Institute, New York University, 1971.
    [79] P. Briggs, Register Allocation via Graph Coloring, PhD thesis, Rice University,1992.
    [80] G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hopkins, a.P.Markstein, Register allocation via coloring, Computer Languages, 1981,6(1):47-57.
    [81] G.J. Chaitin, Register allocation & spilling via graph coloring. In SIGPLAN '82:Proceedings of the 1982 SIGPLAN symposium on Compiler construction, ACM Press, 1982,98-105.
    [82] D. Bernstein, D.Q. Goldin, M.C. Golumbic, et al, Spill code minimization techniques for optimizing compilers, SIGPLAN Notices, 1989, 24(7):258-263.
    [83] P. Briggs, K.D. Cooper, a.L. Torczon, Rematerialization, SIGPLAN Notices,1992,27(7):311-321.
    [84] P. Briggs, K.D. Cooper, K. Kennedy, a.L. Torczon, Coloring heuristics for register allocation, SIGPLAN Notices, 1989, 24(7):275-284.
    [85] F. Chow, J. Hennessy, Register allocation by priority-based coloring. In SIGPLAN '84: Proceedings of the 1984 SIGPLAN symposium on Compiler construction, ACM Press, 1984, 222-232.
    [86] F.C. Chow, J.L. Hennessy, The priority-based coloring approach to register allocation, ACM Trans. Program. Lang. Syst., 1990,12(4):501-536.
    [87] P. Kolte, M.J. Harrold, Load/store range analysis for global register allocation,SIGPLAN Notices, 1993,28(6):268-277.
    [88] P. Bergner, P. Dahl, D. Engebretsen, M. O'Keefe, Spill code minimization via interference region spilling. In PLDI'97:Proceedings of the ACM SIGPLAN '97 Conference on Programming Language Design and Implementation, 1997.

    [89] H. Kim, Region-based Register Allocation for EPIC Architectures, PhD thesis,New York University, 2001.
    [90] T. Nakaike, T. Inagaki, H. Komatsu, T. Nakatani, Profile-Based Global Live-Range Splitting. In PLDI'06: Proceedings of the ACM SIGPLAN 2006 conference on Programming language design and implementation, 2006.
    [91] K.D. Cooper, L.T. Simpson, Live range splitting in a graph coloring register allocator. In Proceedings of the 1998 International Compiler Construction Conference, 1998.
    [92] P. Briggs, K.D. Cooper, L. Torczon, Improvements to Graph Coloring Register Allocation, ACM Trans. Program. Lang. Syst., 1994, 16(3):428-455.
    [93] L. George, A.W. Appel, Iterated Register Coalescing, ACM Trans. Program.Lang. Syst., 1996,18(3):300-324.
    [94] J. Park, S. Moon, Optimistic Register Coalescing, ACM Transactions on Programming Languages and Systems, 2004, 26(4):735-765.
    [95] B.R. Nickerson, Graph coloring register allocation for processors with multi-register operands. In PLDI '90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, ACM Press,1990,40-52.
    [96] P. Briggs, K.D. Cooper, L. Torczon, Coloring register pairs, ACM Letters on Programming Languages and Systems, 1992, 1(1):3-13.
    [97] T. Zeitlhofer, B. Wess, List-Coloring of Interval Graphs with Application to Register Assignment for Heterogeneous Register-set Architectures, ACM Signal Processing, 2003, 83(7):1411-1425.
    [98] M.D. Smith, N. Ramsey, G. Holloway, A generalized algorithm for graph-coloring register allocation. In PLDI '04: Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, ACM Press, 2004, 277-288.
    [99] L.i. Lian, L. Gao, J. Xue, Memory Coloring: A Compiler Approach for Scratchpad Memory Management. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques,IEEE Computer Society, 2005, 329-338.
    [100] Intel, Intel(R) Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual,http://download.intel.com/design/intarch/manuals/24319101.pdf.
    [101] J. Fabri, Automatic Storage Optimization. In SIGPLAN 79: Proceedings of the SIGPLAN symposium on Compiler construction, 1979, 83-91.
    [102] M.R. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman & Co., 1979.
    [103] H.A. Kierstead, The Linearity of First-fit Coloring of Interval Graphs, SIAM J.Discrete Math 1,1988:526-530.
    [104] H.A. Kierstead, A polynomial time approximation algorithm for Dynamic Storage Allocation, Discrete Math., 1991, 87(2-3):231-237.
    [105] J. Gergov, Approximation Algorithms for Dynamic Storage Allocations. In ESA'96: Proceedings of the Fourth Annual European Symposium on Algorithms,Springer-Verlag, 1996, 52-61.
    [106] J. Gergov, Algorithms for compile-time memory optimization. In SODA '99:Proceedings of the 10th annual ACM-SIAM symposium on Discrete algorithms,Society for Industrial and Applied Mathematics, 1999, 907-908.
    [107] A.L. Buchsbaum, H. Karloff, C. Kenyon, N. Reingold, M. Thorup, OPT versus LOAD in dynamic storage allocation. In STOC '03: Proceedings of the 35th annual ACM symposium on Theory of computing, ACM Press, 2003, 556-564.
    [108] G. Confessore, P. Dell'Olmo, S. Giordani, An approximation result for the interval coloring problem on claw-free chordal graphs, Discrete Appl. Math., 2002,120(1-3):73-90.
    [109] S.V. Pemmaraju, S. Penumatcha, R. Raman, Approximating interval coloring and max-coloring in chordal graphs, J. Exp. Algorithmics, 2005,10 :2.8.
    [110] C. Andersson, Register Allocation by Optimal Graph Coloring. In CC'03:Proceedings of the 12th International Conference on Compiler Construction,Springer-Verlag, 2003,
    [111] F.M. Pereira, J. Palsberg, Register Allocation via Coloring of Chordal Graphs.In APLAS'05: Proceedings of the 3rd Asia Symposium on Programming Language and Systems, 2005, 315-329.
    [112] S. Hack, D. Grand, G. Goos, Register Allocation for Programs in ssa-form. In CC'06: Proceedings of the 15th International Conference on Compiler Construction, Springer-Verlag, 2006,

    [113] L. Li, Q.H. Nguyen, J. Xue, Scratchpad Allocation for Data Aggregates in Supperperfect Graphs, ACM SIGPLAN NOT., 2007,42(7):207-216.
    [114] Govindarajan, S. Rengarajan, Buffer Allocation in Regular Dataflow Networks:An approach Based on Coloring Circular-arc Graphs. In Proceedings of the 2nd International Conference on High Performance Computing, 1996,
    [115] S.V. Pemmaraju, R. Raman, Approximation Algorithms for the Mac-Coloring Problem. In International Colloquium on Automata, Language and Computation,2005, 1064-1075.
    [116] J. Mauro, R. McDougall, Solaris Internals Core Kernel Architecture, Sun Microsystems Press.
    [117] J. Bonwick, The Slab Allocator: An Object-Caching Kernel Memory Allocator.In Proceedings of USENIX Conference, 1994, 87-98.
    [118] C.G. Nevill-Manning, I.H. Witten., Linear-time, incremental hierarchy inference for compression. In Proceedings of the Data Compression Conference, 1997.
    [119] J.R. Larus, Whole program paths. In PLDI'99:Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation, 1999,259-269.
    [120] T.M. Chilimbi, Efficient representations and abstractions for quantifying and exploiting data reference locality. In PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, ACM Press, 2001, 191-202.
    [121] T. Sherwood, B. Calder, J. Emer, Reducing cache misses using hardware and software page placement. In ICS '99: Proceedings of the 13th international conference on Supercomputing, ACM Press, 1999,155-164.
    [122] NASA, The NAS Parallel Benchmarks Version 3.2,http://www.nas.nasa.gov/Software/NPB/.

    [123] SPEC.ORG, SPEC2000 benchmarks, http://www.spec.org/cpu2000/.
    [124] L.A. Belady, A Study of Replacement Algorithms for a Virtual-Storage Computer, IBM Systems Journal, 1966, 5(2):78-101.
    [125] S. Rixner, Stream Processor Architecture, Kluwer Academic Publishers Group,2002.
    [126] M. Taylor, J. Kim, J. Miller, et al, The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs, IEEE Micro, 2002,22(2):25-35.
    [127] M.I. Gordon, W. Thies, M. Karczmarek, et al, A stream compiler for communication-exposed architectures. In ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ACM Press, 2002,291-303.
    [128] J.D. Owens, S. Rixner, U.J. Kapasi, et al, Media Processing Applications on the Imagine Stream Processor. In ICCD'02: Proceedings of 20th IEEE International Conference on Computer Design, 2002,295-302.
    [129] J.D. Owens, Computer Graphics on a Stream Architecture, PhD thesis, Stanford,2002.
    [130] T. Hoare, Communicating Sequential Processes, Communications of the ACM,1978, 8(21):666-677.

    [131] R. Stephens, A Survey of Stream Processing, Report, 1995.
    [132] W. Thies, M. Karczmarek, S. Amarasinghe, Streamlt: A Language for Streaming Applications. In Proceedings of the International Conference on Compiler Construction, 2002,179-196.

    [133] I. Buck, Brook Spec v0.2, report, Stanford University, 2003.
    [134] Evolving the High Performance Computing and Communications Initiative to Support the Nation's Information Infrastructure, National Academy Press,Washington,D.C., 1995.
    [135] M.E. Wolf, M.S. Lam, A Data Locality Optimizing Algorithm. In Proc. of the SIGPLAN'91 Symp. on Programming Language Design and Implementation,1991.
    [136] M.Wolf, High Performance Compilers for Parallel Computing, Addison-Wesley Publishing Company, 1996.
    [137] S.Coleman, K.Mckinley, Tile size selection using cache organization and data layout. In PLDI '95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, 1995.
    [138] R.Panda, H.Nakamura, N.Dutt, A.Nicolau, Augmenting loop tiling with data alignment for improved cache performance, IEEE Transactions on Computers,1999,48(2):142-149.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700