基于区域的编译技术和栈寄存器优化

作者：刘旸
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：区域 ; 单入多出区域 ; 多入多出区域 ; 寄存器栈引擎 ; 寄存器栈帧
英文关键词：Region ; Single Entry Multiple Exits Region ; Multiple Entries Multiple Exits Region ; RegisterStack Engine ; Register Stack Frame
学位年度：2003
导师：张兆庆
学科代码：081201
学位授予单位：中国科学院研究生院（计算技术研究所）
论文提交日期：2003-05-01

摘要

为了提高指令级并行,编译器必须进行大量的优化。而复杂的编译优化算法需要耗费大量的编译时间和资源。为了减少编译的时间,并且给研究者提供一个灵活的研究平台,同时保证编译性能不受损害,本文提出了一种灵活的区域构造框架。在研究过程中,我们发现RSE(Register Stack Engine)开销在某些程序的运行过程中占有很大的比例,针对这个问题,提出了一种能够有效减少RSE开销提高代码性能的方法。
     文章的主要贡献有如下几点:
     1提出了一个新的区域构造框架,作为编译器各优化阶段的统一编译单位。这个区域构造框架是分层次的,其形状和大小可以根据优化阶段的需要灵活控制。控制流图中的所有基本块都至少被一个区域所包含。区域之间是以树状的关系组织在一起。
     2提出了一个可以参数化设定最大尾复制比率和最小出口概率限制的单入口多出口区域构造算法。实验表明该算法构造的区域对于适应不同的优化阶段以提供足够多的优化机会是非常有效的。
     3提出了区域属性的概念。区域的属性可以在编译器不同的优化阶段被设定,观察,维护和清除。这些属性对于维护区域前后优化阶段的效果是非常重要的。
     4提出了一个RSE开销的代价模型;基于这个代价模型,提出了过程间栈寄存器配额分配问题。
     5基于前述的量化代价模型,给出了一个过程间栈寄存器分配算法。该算法可以有效地减少总的内存访问开销,从而提高程序的执行性能。
     6提出了一种可重计值活跃区间的优化寄存器分配方法。对于某些可重计值的活跃区间,由于它的定义点的总执行频率大于它的引用点的总执行频率,则它没有被分配到寄存器时,溢出处理会从这些定义点移动到引用点的基本块里去执行,减少了实际执行时的代码执行时间。
     7上述算法均在ORC编译器中实现,并利用Spec2000Int程序对提出的算法进行了测试,对比了基于区域的编译和基于函数的编译时不同的优化效果。实验结果表明,区域的构造大大地减少了程序的编译时间,同时提高了程序的执行性能。过程间的寄存器分配算法对于RSE开销比较大的程序,性能的提高非常显著。而对于一些RSE开销不是很明显的程序,性能并没有明显的变化,或者略微有所提高。对可重计值活跃区间的优化则对Perlbmk的性能有明显提高。总体而言,程序的执行性能有了比较显著的提高。最后,对以上算法的扩展进行了讨论,提出了可以在未来继续深入研究的问题。
     随着处理器执行频率的加快,存储优化问题必将越来越受到研究者的重视,
In order to improve instruction level parallelism,compilers tend to adopt more aggressive and complex optimization algorithms. But too complex and aggressive optimization algorithm will cost much compilation time and resources. In order to reduce compilation time and provide researchers a flexible research platform, on the meanwhile, prove the compilation performance,we proposed a flexible region formation infrastructure. In order to solve this problem that RSE cost is very serious for some programs, we proposed a effective algorithm to reduce RSE cost and improve overall execution performance.
     The main contributions of thesis are:
     1 We propose a new region formation infrastructure. This region formation infrastructure is hierarchical. Size and shape of regions could be controlled flexibly by setting specific parameters. Every basic block in the control flow is contained by at least one region. Regions are organized in a tree.
     2 We propose a Single Entry Multiple Exits region formation algorithm with max tail duplicate ratio and min exit probability constraints. Experiments show this algorithm is very effective in forming regions which could provide enough optimization opportunities for different optimization phases.
     3 We propose the attributes of region. Region attributes could be set,observed,maintained and deleted across different optimization phases. These attributes are very important for guarantee of the optimization effects.
     4 We propose a RSE cost model, and based on this cost model, we proposed the inter-procedural stacked register quota assignment problem.
     5 We propose a inter-procedural stacked register allocation algorithm,and it is implemented in ORC compiler.This algorithm could reduce the overall spill-to-memory access time so as to improve the program execution performance. This algorithm is based on the cost model proposed by us.
     6 We propose a method to decide whether allocating a stacked register to rematerializable live ranges. For a rematerializable live ranges, if the total execution frequency of the basic blocks where it is defined is greater then the total execution frequency of the basic blocks where it is used, then we need not allocate a register to it because spill could move those define to the basic blocks where it is used,therefore reduce the execution frequency of loads.

引文

[Agarwal97] A. Agarwal, S. Amarasinghe, R. Barua, M. Frank, W. Lee, V. Sarkar, and M. Srikrishna, D.and Taylor. The RAW compiler project. In Proc. Second SUIF Compiler Workshop, August 1997.
    [Aho86] A. Aho, R. Sethi and J. Ullman, "Compilers: Principles, Techniques, and Tools", MA:Addison-Wesley, 1986.
    [Aiken88] A. Aiken and A. Nicolau, Optimal loop parallelization," in Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pp. 308-317, June 1988.
    [Akkary98] H. Akkary and M. A. Driscoll, “A Dynamic Multithreading Processor,” Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-31), pp. 226-236, Dallas,TX, November-December 1998.
    [Allen83]J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence" In Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pp. 177-189, January 1983.
    [Allen88] R. Allen and S. Johnson, Compiling C for vectorization, parallelization, and inline expansion," in Pro-ceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pp. 241{249, June 1988.
    [August97] David I. August and, Wen-mei W. Hwu and Scott A. Mahlke. A Framework for Balancing Control Flow and Predication. Proceedings of MICRO-30, IEEE, 1997.
    [Beck93]Beck G.R.,et al.,1993,The Cydra5 minisupercomputer: Architecture and implementation,The J of Supercomputing,7,1/2:143-180.
    [BernsteinD91] D. Bernstein and M. Rodeh. Global instruction scheduling for superscalar machines. SIGPLAN’91, PLDI, 1991.
    [Bernstein91] D. Bernstein, D. Cohen, and H. Krawczyk, "Code Duplication: An Assist for Global Instruction Scheduling" In Proceedings of 24th Annual ACM/IEEE Intl. Symp. and Workshop on Microarchitecture,1991.
    [Bharadwaj99] Jay Bharadwaj, Kishore Menezes, et al., Wavefront Scheduling: Path Based Data Representation and Scheduling of Subgraphs Proceedings of 32nd Annual International Symposium on Microarchitecture, Haifa, Israel, November 16--18, 1999
    [Bring93] R.A.Bringmann,S.A.Mahlke, R.E.Hank, J.C.Gyllenhaal,W.W.Hwu,Speculative Execution Exception Recovery Using Write-Back Suppression,In Proc of the 26th Annual Int’l Symposium on Microarchitecture,Dec,1993.
    [BriggsPhDThesis] P. Briggs, Register Allocation via Graph Coloring. PhD thesis, Rice University,1992.
    [Briggs94] P.Briggs, K. Cooper, and L. Torczon. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems, 16(3):428-- 455, May 1994.
    [Callahan90] Callahan D.,et al.,1990,Improving register allocation for subscripted variables,In Proc ACM SIGPLAN Programming Language Design and Implementation,Toronto,June,pp:192-203.
    [Callahan91] D. Callahan and B. Koblenz, Regisster Allocation via Hierarchical Graph Coloring , in SIGPLAN 91 conference on Programming Language Design and Implementation, pages 192-203, Toronto, ON, June 1991.
    [Chaitin82] CHAITIN, G. Register allocation and spilling via graph coloring. In Proceedings of the SIGPLAN 82 Symposium on Compiler Construction (Boston, Mass., June 1982). ACM, New York, 1982, pp. 98-105.
    [Charlesworth81] Charlesworth.A.E,1981,An approach to scientific array processing:The architectural design of the AP-120B/FPS-164 family,Computer,14,9:18-27.
    [Chang88] P. P. Chang and W. W. Hwu, Trace selection for compiling large C application programs to microcode , " in Proceedings of the 21st International Workshop on Microprogramming and Microarchitecture, pp. 188-198, November 1988.
    [Chang91] P. P. Chang, S. A. Mahlke, and W. W. Hwu. Using profile information to assist classic code optimizations. Software Practice & Experience, 21(12):1301--1321, December 1991.
    [Chang95] P.Chang, N. Waters, S.A.Mahlke,W.Y.Chen, and W. W. Hwu, Three Architectural Models for Compiler-controlled Speculative Execution, IEEE Trans on Computers, Vol.44, No.4, pp.481-494, April 1995.
    [Choi01] Y. Choi, A. Knies, L. Gerke, and T. Ngai, The Impact of If-conversion and Branch Prediction on Program Execution on the Intel Itanium Processor, Proceedings of MICRO-34, IEEE, 2001
    [Chow84] CHOW, F. C., AND HENNESSY, J. L. Register allocation by priority-based coloring. In Proceedings of the SIGPLAN 84 Symposium on Compiler Construction (Montreal, June 1984). ACM, NewYork, 1984, pp. 222-232.
    [Chow88] Fred C. Chow,Minimizing Register Usage Penalty at Procedure Calls,In Proceedings of the SIGPLAN'88 Conference on Programming Language Design and Implementation,June 1988.
    [Chow96] Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, and Mark Streich. Effective Representation of Aliases and Indirect Memory Operations in SSA Form. In Proceedings of the International Conference on Compiler Construction, pages 253-267, 1996.
    [Chow97] Chow.F,Chan S,Kennedy R,Liu S,Lo R,and Tu P,1997,A new algorithm for partial redundancy elimination based on SSA form.In proceedings of the ACM SIGPLAN ’97 Conference on Programming Language Design and Implementation,273-286.
    [Colwell88] Colwell R.P.,et al.,1988 A VLIW architecture for a trace scheduling compiler,IEEE Trans Comps,C-37,8(Aug):967-979.
    [Colwell90] Colwell R.P.,et al.,1990,Architecture and implementation of a VLIW supercomputer,In Proc Supercomputing,pp:910-919.
    [Cytron91] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck, Efficiently Computing Static Single Assignment Form and Control Dependence Graph, ACM Trans. On Programming Languages and Systems, 13(4):452-490, October 1991.
    [Davidson92] J. W. Davidson and A. M. Holler, Subprogram inlining: A study of its effects on program execution time," IEEE Transactions on Software Engineering, vol. 18, pp. 89{101, February 1992.
    [Dehnert89] Dehnert J.C.,et al.,1989,Overlapped loop support in the Cydra5,Proc Third Internat Conf on Architecture Support for Programming Languages and Operating Systems,Boston,April,pp:116-121.
    [Dehnert93] Dehnert J.C. and Towle R.A.,1993,Compiling for the Cydra5,The Journal of Supercomputing,7,1/2:181-227.
    [Douillet02] A. Douillet, J. N. Amaral, and G. R. Gao. Fine-grained stacked register allocation for the Itanium architecture. In 15th Work-shop on Languages and Compilers for Parallel Computing(LCPC), 2002.
    [Ebcioglu87] K. Ebcioglu, A compilation technique for software pipelining of loops with conditional jumps," in Proceedings of the 20th Annual Workshop on Micro programming and Microarchitecture, pp. 69-79, December 1987.
    [Ebcioglu88] Ebcioglu K.,Some design ideas for a VLIW architecture for sequential-natured software,In Parallel Processing,North Holland,pp:3-21.
    [Ebcioglu89] K. Ebcioglu and T. Nakatani, \A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture," in Languages and Compilers for Parallel Computing, pp. 213{229, 1989.
    [Eich95] Alexandre E. Eichenberger and Edward S. Davidson. Register Allocation for Predicated Code. Proceedings of MICRO-28, IEEE, 1995.
    [Ellis85] Ellis J.R.,1986,Bulldog:A Compiler for VLIW Architectures,MIT Press,Cambridge,Mass.
    [Ferrante87] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM TOPLAS, 9(3):319–349, July 1987.
    [Fisher79] Fisher J.A.,1979,The optimization of horizontal microcode within and beyond basic blocks:An application of processor scheduling with resources,Ph.D thesis,New York Univ.,New York.
    [Fisher80] Fisher J.A.,1980,2N-way jump microinstruction hardware and an effective instruction binding method,In Proc 13th Annual Workshop on Microprogramming,pp:64-75.
    [Fisher81] J. Fisher, "Trace scheduling: a technique for global microcode compaction", IEEE Trans. on Computers, Vol. No. 7, pp. 478-490, 1981.
    [Fisher83] Fisher J.A.,1983,Very long instruction word architecture and the ELI-512,In Proc Tenth Annual International Symp. on Computer Architecture,Stockholm,June,pp:140-150.
    [Floating Point Systems79] Floating Point Systems, 1979 , FPS AP-120B Processor Handbook, Floating Point Systems Inc.,Beaverton,Ore.
    [Gabbay97] F. Gabbay and A. Mendelson. Can Program Profiling Support Value Prediction? Proceedings of the 30 th Annual ACM/IEEE International Symposium on Microarchitecture, December, 1997.
    [Garey79] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.
    [Gillies96] D. M. Gillies, D. R. Ju, R. Johnson, and M. Schlansker, "Global predicate analysis and its application to register allocation" In Proceedings of the 29th International Symposium on Microarchitecture, pp. 114-125, December 1996.
    [González97] J. González and A. González, “Speculative Execution via Address Prediction and Data Prefetch-ing”,in Proc of 11th. ACM Int. Conf. on Supercomputing, pp. 196-203,1997.
    [González-2-97] J. González and A. González, “Memory Address Prediction for Data Speculation”, in Proc. of EURO-PAR 97 Workshop on ILP, pp. 1084-1091, 1997.
    [Gupta90] M. Gupta and M. L. Soffa, "Region Scheduling", IEEE Trans. on Software Engineering, vol. 16, pp. 421-431, April 1990.
    [Gupta94] R. Gupta, M. L. Soffa and D. Ombres, "Efficient Register Allocation via Coloring Using Cluque Separators", ACM Trans. On Programming Languages and Systems, Vol. 16, pp370-386,May 1994.
    [Hank96] R. E. Hank, "Region Based Compilation", Doctoral thesis, University of Illinois at Urbana Champaign,1996.
    [Hank97] R. E. Hank , W. W. Hwu and B. R. Rau , "Region Based Compilation:Introduction, Motivation and Initial Experience", International Journal of Parallel Programming, 25(2):113-146, Apr, 1997.
    [Havanki97] W. A. Havanki, "Treegion scheduling for VLIW processors", MS Thesis, Dept. of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, 1997.
    [Hsu86] P.Y. Hsu and E. S. Davidson, "Highly concurrent scalar processing," in Proceeding of the 13th International Symposium on Computer Architecture, pp. 386-395, June 1986.
    [Hwu86] Hwu W.W.,et al.,1986,HPSm,a high performance restricted data flow architecture having minimal functionality,In Proc 15th Annual Internat Symp on Computer Architecture,Honolulu,May,pp:45-53.
    [Hwu87] Hwu W.W.,et al.,Checkpoint repair for out-of-order execution machines,IEEE Trans Comps,C-36,12(Dec),1496-1514.
    [Hwu89] W. W. Hwu and P. P. Chang, Inline function expansion for compiling realistic C programs," in Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation, pp. 246{257, June 1989.
    [Hwu93] W.W. Hwu,et al, "The Superblock: An effective way for VLIW and superblock compilation", The Journal of Supercomputing, vol. 7, pp. 229-248, January 1993.
    [IBM76] IBM 1976, IBM 3838 Array Processor Functional Characteristics,Pub no 6A24-3639-0,file no S370-08,IBM Corporation,Endicon,NY.
    [Intel89a] Intel 1989a,i860 64-bit Microprocessor Programmer's Reference Manual,Pub no 240329-001,Intel Coporation,Santa Clara,California.
    [IntMan00] Intel Corporation. Intel IA-64 Architecture Software Developer’s Manual. Santa Clara, CA, 2000.
    [Janssen96] Johan Janssen and Henk Corporal , Controlled Node Splitting, In the proceedings of the 6th International Conference of Compiler Construction,Sweden,April 1996.
    [Johnson86] M. S. Johnson and T. C. Miller. Effectiveness of a machine-level global optimizer. In Proc. ACM SIGPLAN ’86 Symp. on Compler Construction, pages 99–108. ACM, July 1986.
    [Johnson96] Richard Johnson and Michael Schlansker. "Analysis techniques for predicated code" In Proceedings of the 29th Annual International Symposium on Microprogramming, pages 100-113, December 1996.
    [Ju00] Roy Ju, High-level Design of Intel-enhanced Pro64 Open Source Compiler,2000.
    [Ju01] Roy Dz-ching Ju, Kevin Nomura, Uma Mahadevan, Le-Chun Wu, A Unified Compiler Framework for Control and Data Speculation,International Conference on Parallel Architectures and Compilation Techniques,Oct,2001.
    [Kathail2000] HPL-PD Architecture Specification:Version 1.1,Vinod Kathail, Michael S. Schlansker, B. Ramakrishna Rau,HP Technical Report,Feb,2000.
    [Kennedy99] Kennedy R,Chan S,Liu S,Lo R,Tu P,and Chow F,1999.Partial Redundancy Elimination in SSA Form, ACM Transactions on Programming Languages and Systems.
    [Kohn89]Kohn L.,et al.,Introducing the Intel i860 64-bit microprocessor,IEEE Micro,9,4(Aug),pp:15-30.
    [Kur96] S. M. Kurlander and C. N. Fisher, “Minimum Cost Interprocedural Register Allocation”, Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 1996, pp.230-241, 1996.
    [Labrousse88] Labrousse J. and Slavenburg G.A.,1988,CREATE-LIFE:A design system for high performance VLSI circuits,In Proc Internat Conf on Circuits and Devices,pp:365-360.
    [Labrousse90a] Labrousse J. and Slavenburg G.A.,1990,A 50 MHz microprocessor with a VLIW architecture,In Proc ISSCC'90,San Francisco,pp:44-45.
    [Labrousse90b] Labrousse J. and Slavenburg G.A.,1990,CREATE-LIFE:A modular design approach for high performance ASICs,In Proc COMPCON'90,San Francisco,pp:427-433.
    [Lam88] M. S. Lam, Software pipelining: An effective scheduling technique for VLIW machines," in Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation,pp. 318-328, June 1988.
    [Lam92] Monica S. Lam and Robert P. Wilson. Limits of control flow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 46--57, May 19--21, 1992.
    [Lee84] Lee J.K.F., and Smith A.J.,1984,Branch prediction strategies and branch target buffer design.,Computer 17.,1(Jan):6-22.
    [Lengauer79] Lengauer.Thomas and Robert E. Tarjan., A Fast Algorithm for Finding Dominators in a Flow Graph,ACM TOPLAS,Vol No. 1,July,1979,p121-141.
    [Linn88] Linn J.L.,1988,Horizontal microcode compaction,In Microprogramming and Firmware Engineering Methods,New York.
    [Lipasti96] M. H. Lipasti, C. B. Wilkerson, and J. P. Shen, “Value locality and data speculation,” in Proceedings of the 7th nter-national Conference on Architectural Support for Program-mingLanguages and Operating Systems, pp. 138–147, Oc-tober 1996.
    [Lipasti-2-96] M. H. Lipasti and J. P. Shen, “Exceeding the dataflow limit via value prediction,” in Proceedings of the 29th AnnualACM/IEEE International Symposium and Workshop on Mi-croarchitecture, pp. 226–237, December 1996.
    [Lipasti-3-96] M.H. Lipasti, C.B. Wilkerson and J.P. Shen, “Value Locality and Load Value Prediction”, in Proc.of the 7th. Conf. on Architectural Support for Programming Languages and Operating Systems, pp.138-147, Oct. 1996.
    [Liu01] Liu Yang , Zhaoqing Zhang , Ruliang Qiao , Region Based Compilation Infrastructure,Workshop of CHiPS,Tsinghua,Beijing,2001.
    [Liu03] Liu Yang, Zhaoqing Zhang,Ruliang Qiao,Roy Ju, A Region Based Compilation Infrastructure,Workshop of INTERACT-7, CA,USA Feb,2003.
    [Liu03a] Liu Yang,Sun Chan,G.R.Gao.,Roy Ju,Guei-Yuan Lueh,Zhaoqing Zhang, Inter-Procedural Stacked Register Allocation for Itanium? Like Architecture,To be appeared in Proceedings of ACM SIGARCH International Conference of Supercomputing 2003,San Francisco,June 23th-26th.
    [Lowney93] Lowney P.G.,et al.,1993,The multiflow trace scheduling compiler,The Journal of Supercomputing,7,1/2:51-142.
    [Lueh97] G. Lueh and T. Gross. Call-cost directed register allocation. In Proc. ACM SIGPLAN '97 Conf. on Prog. Language Design and Implementation, pages 296--307. ACM, June 1997. ACM Transactions on Programming Languages and Systems.
    [LuehPhDThesis] Fusion Based Register Allocation,G. Lueh,Carnegie Mellon University.
    [Mahlke92] S. A. Mahlke, D. C. Liu, W. Y. Chen, R. E. Hank and R. A. Bringmann, "Effective compiler support for predicted execution using the hyperblock" In Proceedings of 25th International symposium of Microarchitecture, pp45-54,1992.
    [Mahlke94] S. A. Mahlke, R. E. Hank, R. A. Bringmann, J. C. Gyllenhaal, D. M. Gallanger, and W.W. Hwu, "Characterizing the impact of predicated execution on branch prediction," in Proceedings of the 27th International Symposium on Microarchitecture, pp. 217-227, December 1994.
    [Marcuello98] P. Marcuello, A. González, and J. Tubella, “Speculative Multithreaded Processors,” Proceedings of the 1998 ACM International Conference on Supercomputing, pp. 77-84, Melbourne,Australia, July 1998.
    [Mcc96] James Earl Mccormick, JR, Supporting Predicated Execution: Techniques and Tradoffs. 1996, thesis of UIUC.
    [McFaring86] McFaring S. and Hennessy J.,1986,Reducing the cost of branches,In Proc Thirteen Internat Symp on Computer Architecture,Tokyo,June,pp:396-403.
    [Moon92] S. M. Moon and K. Ebcioglu. An efficient resource constrained global scheduling technique for superscalar and VLIW processors. Proceedings of the 25th International Symposium on Microarchitecture (MICRO25), pages 55–71,1992.
    [Muchnick97] Steven S. Muchnick, Advanced Compiler Design and Implementation, 1997.
    [Nicolau85a] Nicolau A.,1985,Percolation Scheduling:A parallel compilation technique,Tech Rept,TR 85-678,Dept of Comp,Sci,Cornell,Ithaca,NY.
    [Oplinger97] J. Oplinger, D. Heine, S. Liao, B. A. Nayfeh, M. S. Lam, and K.Olukotun , “Software and Hardware for Exploiting Speculative Parallelism in Multiprocessors,” Computer Systems Laboratory Technical Report CSL-TR-97-715, Stanford University, February 1997.
    [ORC03] Open Research Compiler for Itanium Processors, http://ipf-orc.Sourceforge.net, ipf-orc-support@lists.sourceforge.net, Jan 2003.
    [Peterson81]Peterson C.,et al.,1981,RISC I:A reduced instruction set VLSI computer,In Proc 8th Annual Sysm on Computer Architecture,Minneapolis,May,pp:159-164.
    [Rau89] B.R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle, " The Cydra 5 departmental supercomputer," IEEE Computer, vol. 22, pp.12-35, January 1989.
    [Park91] J. C. Park and M. S. Schlansker, "On predicated execution," Tech. Rep. HPL-91-58, Hewlett Packard Laboratories, Palo Alto, CA, May 1991.
    [Rau81] B. R. Rau and C. D. Glaeser, \Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientic computing," in Proceedings of the 20th Annual Workshop on Micro programming and Microarchitecture, pp. 183-198, October 1981.
    [Rau88]Rau B.R.,Cydra5 Directed Dataflow architecture,In Proc.,COMPCON'88,San Francisco,Mar,pp:106-133.
    [Rau89]Rau B.R.,et al.,The Cydra5 departmental supercomputer:Design philosophies,decisions and trade-offs,Computer,22,1(Jan):12-34.
    [Rau92] B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker. Register allocation for software pipelined loops. In Proceed-ings of the ACM SIGPLAN 92 Conference on Programming Language Design and Implementation, pages 283–299, June 1992.
    [Rau94] B. R. Rau, Iterative modulo scheduling: An algorithm for software pipelining loops," in Proceedings of the 27th International Symposium on Microarchitecture, pp. 63-74, December 1994.
    [Ruggiero69] Ruggiero .J.F and Coryell D.A.,1969, An auxilliary processing system for array calculation,IBM System J.,8,2:118-135.
    [Schlansker2000] EPIC: An Architecture for Instruction-Level Parallel Processors, Michael S. Schlansker, B. Ramakrishna Rau,HP Technical Report,Feb,2000.
    [Settle02] A.Settle et al. , Optimization for the Intel Architecture Register Stack , International Conference of Code Generator Optimization, 2002.
    [Smith81] Smith J.E.,1981,A study of branch prediction strategies,In Proc Eighth Annual Internat Symp on Computer Architecture,May,pp:135-148.
    [Smith88] Smith J.E.,et al.,1988,Implementing precise interrupts in pipelined processors,IEEE Trans Comps,C-37,5(May),:562-573.
    [Sohi87] Sohi G.S.,et al.,1987,Instruction issue logic for high-performance,interruptible pipelined processors,In Proc 14th Annual Symp on Computer Architecture,Pittsburgh,June,pp:27-36.
    [Steenkiste89] Peter A Steenkiste and John L. Henessy,A simple interprocedural register allocation algorithm and its effectiveness for LISP.Transactions on Programming Languages and Systems,pages 1-30,January 1989.
    [Sohi95] G. Sohi, S. Breach, and T. Vijaykumar, “Multiscalar Processors,”Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 414–425, Ligure, Italy, June 1995.
    [Tarjan72] Tarjan,Robert Endre,Depth First Search and Linear Graph Algorithm,SIAM J. of Computing,Vol.1,No.2,p355-365,1972.
    [Tsai96] Jenn-Yuan Tsai and Pen-Chung Yew. The superthreaded architecture:Thread pipelining with run-time data dependence checking and control speculation. In Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques, PACT '96, pages 35--46, October 20--23, 1996.
    [Tullsen95] D.M. Tullsen, S.J. Eggers and H.M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism”, in Proc. of the Int. Symp. on Computer Architecture, pp. 392-403, 1995.
    [Vijaykumar98] N. Vijaykumar and Gurindar S. Sohi. Task selection for a Multiscalar processor. In Proceedings of the 31st Annual International Symposium on Microarchitecture, December 1998.
    [Wall86] David W. Wall. Global Register Allocation at Link Time. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, pages 264-275, New York, 1986.
    [Wall91] D. W. Wall, “Limits of Instruction-Level Parallelism,” Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV),pp. 176-188, Santa Clara, CA, 1991.
    [Warter93] N. J. Warter, D. M. Lavery, and W. W. Hwu, "The benefit of Predicated Execution for software pipelining, " in Proceedings of the 23rd Hawaii International Conference on System Sciences, to appear January 1993
    [Weldon02] R. D. Weldon et al., Quantitative Evaluation of the Register Stack Engine and Optimization for Future Itanium Processor,In proceedings of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures,Boston,Massachusetts,2002.
    [Whirl] Whirl Intermediate Language Specification,SGI company.
    [Yeh92] Yeh T.Y., and Patt Y.N.,1992,Alternative implementation of two-level adaptive branch prediction,In Proc Nineteenth Internat Symp on Comp Architecture,Gold Coast,Australia,May,pp:124-134.
    [Zima and Chapman90] Zima H. and Chapman B.,1990,Supercompilers for Parallel and Vector Computers,Addison-Wesley,Reading,Mass.
    [刘旸 03] 刘旸,张兆庆,乔如良,基于域的编译框架,计算机学报,第 25 卷,第二期,2003.2.
    [鉴定材料 00] IA-64 开放源码编译器鉴定材料,2000.12.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700