处理器条件分支指令处理关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

处理器条件分支指令处理关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Kev Techniques of Conditional Branch Processing
作者：陈晨
论文级别：博士
学科专业名称：电路与系统
中文关键词：条件分支 ; 分支预测 ; 预测错误高峰期 ; 动态自适应过滤机制 ; 多级缓冲 ; 自适应分支预测粒度 ; 循环加速处理 ; PC越级传输
英文关键词：conditional branch ; branch prediction ; peaks of misprediction ; dynamic
英文关键词：adaptive filtering mechanism ; multiple-level buffered ; adaptive prediction granularity ; loop accelerating processing ; transferring PC across pipelines
学位年度：2013
导师：严晓浪
学科代码：080902
学位授予单位：浙江大学
论文提交日期：2013-04-01

摘要

随着各种应用对处理器性能的需求不断提高,超标量、超深流水线以及投机执行等技术被应用到处理器设计中以提高指令并行度,而条件分支指令由于具备条件执行及程序流控制的双重特性,对并行度造成负面影响,因此高效的条件分支指令处理是保证上述技术发挥潜能的前提。本文重点研究若干面向性能优化的条件分支处理关键技术,主要研究内容和创新点包括：
     1、基于预测极性动态变换的分支预测方法研究。通过研究分支预测错误的时间局部性,提出一种基于预测极性动态变换的分支预测方法,动态监测未经极性变换的原始分支预测错误率,筛选出预测错误率高于阈值的预测错误高峰期,将高峰期内的预测极性进行变换,使变换后的最终分支预测错误率维持在较低水平,以提高整体分支预测精度。该方法可解决传统基于分支别名的预测方法无法解决的分支抖动等问题。
     2、基于多层次过滤的分支预测方法研究。通过研究分支预测错误的空间局部性,提出一种基于多层次过滤的分支预测方法,将预测空间分为多个层次,动态监测各层分支预测错误率,进而将各层中集中分布的少数错误倾向性分支过滤到下一层中通进行针对性处理,降低各层预测错误率,从而提高整体预测精度。该方法可解决传统多路预测方法中各通路均需处理全部条件分支从而造成资源利用率不高的问题。
     3、基于多级缓冲以及基于预测粒度自适应的并行分支预测方法研究。先提出一种基于多级缓冲的并行分支预测方法,在分支空闲周期内访问预测器,提前预取后续分支预测信息并对其进行缓存,当同时出现多条条件分支时,从缓存的信息中选取对应预测信息分配给各条分支,该方法可在小于等于8的取指带宽下实现高精度并行分支预测。随后进一步提出一种基于预测粒度自适应的并行分支预测方法,根据取指带宽和分支行为,自适应地将若干条件分支封装成指令包,以指令包作为预测粒度,并以指令包为单位维护历史信息,该方法可在任意取指带宽下实现高精度并行分支预测。
     4、基于解码缓冲器复用及PC越级传输的循环加速方法研究。针对循环体特性,提出一种基于解码缓冲器复用及PC越级传输的循环加速方法,通过PC越级传输,使设计多表项解码缓冲器成为可能,进而复用该缓冲器,在循环过程中从缓冲器内向执行单元发送循环体指令,加速循环执行。并通过自循环宽发射技术,解决循环体指令分布、循环衔接、cache位宽限制等影响循环处理性能的问题。
     本文提出的关键技术对提高条件分支指令处理性能具有积极的理论研究意义与实际应用价值。
With the increasing demand for complex embedded applications, techniques such as superscalar, deep pipelines and speculate execution are employed in modern microprocessors to explore great degrees of instruction parallelism. On the other hand, conditional branch instructions with the characteristic of conditional execution and flow control bring a deleterious effect on the instruction level parallelism. Consequently, techniques mentioned above rely on accurate conditional branch processing in order to develop their potential. This thesis focuses on key techniques of high performance conditional branch processing. The original contributions of this thesis are as follows:
     1. Branch prediction based on dynamic polarity transformation. By study and analysis of the temporal locality property of branch misprediction, a new branch prediction strategy is proposed, which based on dynamic polarity transformation. This approach monitors original branch misprediction rate whose polarity has not been transformed, and detects the periods with original branch misprediction rate higher than a threshold. These periods are called as peaks of misprediction. The polarity of original prediction results will be transformed to make the final prediction during peaks of misprediction, which keeps the final branch misprediction rate at a low level. This scheme can solve problems which traditional branch alias branch predictor cannot solve.
     2. Multi-layered filter (MLF) branch prediction. By study and analysis of the spatial locality property of branch misprediction, a new branch prediction strategy is proposed, which is called multi-layered filter (MLF) branch prediction scheme. The MLF prediction divides the prediction space of branch into multiple layers, and monitors the misprediction rate of each layer. In MLF prediction, only few difficultly predictable branches of each layer are filtered to next layer, and the sub-predictor of next layer can be dedicated to these difficultly predictable branches, improving the prediction accuracy and hardware efficiency. The filtering mechanism can solve the problem of low hardware resource utilization efficiency which traditional multiple-bank based branch predictors suffer.
     3. Multiple-level buffered parallel branch prediction and adaptive prediction granularity parallel branch prediction. A multiple-level buffered parallel (MLBP) branch prediction is proposed at first. The MLBP prediction accesses the predictor continuously in cycles when there is no conditional branch, and prefetch the prediction results of future branches. The prediction results prefetched are buffered at different levels. When multiple conditional branches are fetched at the same time, the prediction results buffered before will be allocated to these branches synchronously. This scheme can get a good performance in processors with an instruction fetch bandwidth less than or equal to eight. Then we further propose a new branch prediction scheme which based on adaptive prediction granularity. The new scheme adaptively changes the prediction granularity according to the bandwidth of instruction fetch and the behavior characteristics of branches. More specifically, different numbers of branches constitute an instruction package, and the branch histories are maintained in packages. As a result, this prediction scheme can process any number of branches in a single package. This prediction scheme can get high prediction accuracy in processors with any instruction fetch bandwidth.
     4. Loop accelerating scheme based on reuse of decode buffer and PC transmission across pipelines. By study and analysis of the characteristics of loop body, a new loop accelerating scheme is proposed, which is based on reuse of decode buffer and PC transmission across pipelines. The new scheme reduces the information needed in decode buffer by transferring PC related information across pipelines, which makes the design of decode buffer with many entries possible. Moreover, the new scheme reuse the decode buffer to process loops in program. That is, a loop body area will be created in decode buffer when a loop conditional branch turns up. Then during the loop execution time, the loop body instructions will be provided by the decode buffer, which improves the efficiency of loop execution. The new cheme further adopts the self-circulation wide issue mechanism to make up the performance losses caused by the loop body alignment problem, loop joining problem and cache output bandwidth problem.
     Techniques proposed in this thesis facilitate the high performance processing of conditional branch, and have positive effects on both theoretical researches and practical applications.

引文

[I]Smith J E and Sohi G S. The microarchitecture of superscalar processors [J]. Proceedings of the IEEE,1995,83(12):1609-1624.
    [2]Tuck N and Tullsen D M. Initial observations of the simultaneous multithreading Pentium 4 processor [C]. Proceedings of the Parallel Architectures and Compilation Techniques,2003 PACT 2003 Proceedings 12th International Conference on,2003:26-34.
    [3]Marcuello P, Gonz A,#225, et al. Speculative multithreaded processors [C]. Proceedings of the Proceedings of the 12th international conference on Supercomputing, Melbourne, Australia, ACM,1998:77-84.
    [4]Calder B and Grunwald D. Fast and accurate instruction fetch and branch prediction [C]. Proceedings of the Computer Architecture,1994, Proceedings the 21st Annual International Symposium on,1994:2-11.
    [5]Calder B, Grunwald D and Emer J. A system level perspective on branch architecture performance [C]. Proceedings of the Proceedings of the 28th annual international symposium on Microarchitecture, Ann Arbor, Michigan, USA, IEEE Computer Society Press,1995:199-206.
    [6]Spiegel J V d, Tau J F, Ala' ilima T F, et al. The ENIAC:history, operation, and reconstruction in VLSI [M]//RA, L R,ULF H. The first computers. MIT Press. 2000:121-178.
    [7]Schaller R R. Moore's law:past, present and future [J]. Spectrum, IEEE,1997, 34(6):52-59.
    [8]Faggin F, Hoff M E, Mazor S, et al. The history of the 4004 [J]. Micro, IEEE, 1996,16(6):10-20.
    [9]Kaivola R. Intel Core i7 processor execution engine validation in a functional language based formal framework [C]. Proceedings of the Proceedings of the 13th international conference on Practical aspects of declarative languages, Austin, TX, USA, Springer-Verlag,2011:1-1.
    [10]Betker M R, Fernando J S and Whalen S P. The history of the microprocessor [J]. Bell Labs Technical Journal,1997,2(4):29-56.
    [11]Weiss S and Smith J E. IBM Power and PowerPC [M]. Morgan Kaufmann Publishers Inc.,1994.
    [12]Rusu S, Tam S, Muljono H, et al. A 65-nm dual-core multithreaded Xeon(?) processor with 16-MB L3 cache [J]. Solid-State Circuits, IEEE Journal of, 2007,42(1):17-25.
    [13]Hennessy J L and Patterson D A. Computer Architecture:A Quantitative Approach [M]. Beijing:China Machine Press,2007.
    [14]Varis N and Manner J. In the network:sandy bridge versus nehalem [J]. SIGMETRICS Perform Eval Rev,2011,39(2):53-55.
    [15]Murray M. Intel's New Tri-Gate Ivy Bridge Transistors:9 Things You Need to Know [J]. PC Magazine Retrieved,2011,7(
    [16]James D. Intel Ivy Bridge unveiled— The first commercial tri-gate, high-k, metal-gate CPU [C]. Proceedings of the Custom Integrated Circuits Conference (CICC),2012 IEEE, IEEE,2012:1-4.
    [17]Marks M P. Future directions in microprocessor technology [J]. IEICE transactions on electronics,1995,78(6):619-622.
    [18]Schlett M. Trends in embedded-microprocessor design [J]. Computer,1998, 31(8):44-49.
    [19]Wikipedia. ARM architecture [EB/OL]. [2007], http://en.wikipedia.org/wiki/ARM architecture#cite note-1
    [20]Beavers B. The story behind the Intel Atom processor success [J]. Design & Test of Computers, IEEE,2009,26(2):8-13.
    [21]Patt Y N, Patel S J, Evers M, et al. One billion transistors, one uniprocessor, one chip [J]. Computer,1997,30(9):51-57.
    [22]Palacharla S, Jouppi N P and Smith J E. Complexity-effective superscalar processors [M]. ACM,1997.
    [23]Young C, Gloy N and Smith M D. A comparative analysis of schemes for correlated branch prediction [M]. ACM,1995.
    [24]Wikipedia. Branch (computer science) [EB/OL]. [2013], http://en.wikipedia.org/wiki/Branch (computer science)
    [25]C-SKYCPU.32-bit High Performance and Low Power Embedded Processor [EB/OL]. [2004].http://www.c-sky.com
    [26]Wikipedia. Locality of reference [J].2013,
    [27]Dujmovic J J and Dujmovic I. Evolution and evaluation of SPEC benchmarks [J]. ACM SIGMETRICS Performance Evaluation Review,1998,26(3):2-9.
    [28]冯子军,肖俊华and章隆兵.处理器分支预测研究的历史和现状[J].信息技术快报,2008,6(4)：21-25.
    [29]Sweetman D. See MIPS run [M]. Morgan Kaufmann,2007.
    [30]郑纬民and汤志忠.计算机系统结构[M].清华大学出版社.1998.
    [31]Wallace S and Bagherzadeh N. Performance issues of a superscalar microprocessor [J]. Microprocessors and Microsystems,1995,19(4):187-199.
    [32]Hilgendorf R, Heim G and Rosenstiel W. Evaluation of branch-prediction methods on traces from commercial applications [J]. IBM journal of research and development,1999,43(4):579-593.
    [33]Drach N and Seznec A. MIDEE:smoothing branch and instruction cache miss penalties on deep pipelines [C]. Proceedings of the Microarchitecture,1993, Proceedings of the 26th Annual International Symposium on, IEEE,1993: 193-201.
    [34]Worrell F. Enhanced branch delay slot handling with single exception program counter [M]. Google Patents.1998.
    [35]Young C and Smith M D. Improving the accuracy of static branch prediction using branch correlation [C]. Proceedings of the ACM SIGPLAN Notices, ACM,1994:232-241.
    [36]Smith J E. A study of branch prediction strategies [C]. Proceedings of the 25 years of the international symposia on Computer architecture (selected papers), Barcelona, Spain, ACM,1998:202-215.
    [37]Burch C. PA-8000:a case study of static and dynamic branch prediction [C]. Proceedings of the Computer Design:VLSI in Computers and Processors, 1997 ICCD'97 Proceedings,1997 IEEE International Conference on, IEEE, 1997:97-105.
    [38]Yeh T-Y and Part Y N. Two-level adaptive training branch prediction [C], Proceedings of the Proceedings of the 24th annual international symposium on Microarchitecture, ACM,1991:51-61.
    [39]McFarling S. Combining branch predictors, Technical Report TN-36 [R]. Digital Western Research Laboratory,1993.
    [40]Yeh T-Y and Patt Y N. A comparison of dynamic branch predictors that use two levels of branch history [C]. Proceedings of the ACM SIGARCH Computer Architecture News, ACM,1993:257-266.
    [41]Evers M, Patel S J, Chappell R S, et al. An analysis of correlation and predictability:What makes two-level branch predictors work [J]. ACM SIGARCH Computer Architecture News,1998,26(3):52-61.
    [42]Thomas R, Franklin M, Wilkerson C, et al. Improving branch prediction by dynamic dataflow-based identification of correlated branches from a large global history [C]. Proceedings of the ACM SIGARCH Computer Architecture News, ACM,2003:314-323.
    [43]Sazeides Y, Moustakas A, Constantinides K, et al. The significance of affectors and affectees correlations for branch prediction [J]. High Performance Embedded Architectures and Compilers,2008,243-257.
    [44]Porter L and Tullsen D M. Creating artificial global history to improve branch prediction accuracy [J]. History,2009,1(1):2.
    [45]Lee C-C, Chen I-C and Mudge T N. The bi-mode branch predictor [C]. Proceedings of the Microarchitecture,1997 Proceedings, Thirtieth Annual IEEE/ACM International Symposium on, IEEE,1997:4-13.
    [46]Chang P-Y, Evers M and Patt Y N. Improving branch prediction accuracy by reducing pattern history table interference [J]. International journal of parallel programming,1997,25(5):339-362.
    [47]Michaud P, Seznec A and Uhlig R. Trading conflict and capacity aliasing in conditional branch predictors [C]. Proceedings of the ACM SIGARCH Computer Architecture News, ACM,1997:292-303.
    [48]Eden A N and Mudge T. The YAGS branch prediction scheme [C]. Proceedings of the Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society Press,1998:69-77.
    [49]Seznec A, Felix S, Krishnan V, et al. Design tradeoffs for the Alpha EV8 conditional branch predictor [C]. Proceedings of the Computer Architecture, 2002 Proceedings 29th Annual International Symposium on, IEEE,2002: 295-306.
    [50]Evers M, Chang P-Y and Patt Y N. Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches [C]. Proceedings of the ACM SIGARCH Computer Architecture News, ACM, 1996:3-11.
    [51]Gao H and Zhou H. Adaptive information processing:An effective way to improve perceptron branch predictors [J]. Journal of Instruction-Level Parallelism,2005,7(
    [52]Desmet V, Vandierendonck H and De Bosschere K.2FAR:a 2bcgskew predictor fused by an alloyed redundant history skewed perceptron branch predictor [J]. Journal of Instruction-Level Parallelism,2005,7(1-11.
    [53]Loh G H. Deconstructing the Frankenpredictor for Implementable Branch Predictors [J]. Journal of Instruction Level Parallelism,2005,7(1-10.
    [54]Jimenez D. Idealized piecewise linear branch prediction [J]. Journal of Instruction-Level Parallelism,2005,7(1-11.
    [55]Seznec A. Genesis of the O-GEHL branch predictor [J]. Journal of Instruction-Level Parallelism,2005,7(
    [56]Michaud P. A PPM-like, tag-based branch predictor [J]. Journal of Instruction Level Parallelism,2005,7(1):1-10.
    [57]Seznec A. A 64 kbytes ISL-TAGE branch predictor [C]. Proceedings of the JWAC-2:Championship Branch Prediction,2011.
    [58]Seznec A. A 256 kbits 1-tage branch predictor [J]. Journal of Instruction-Level Parallelism (JILP) Special Issue:The Second Championship Branch Prediction Competition (CBP-2),2007,9(
    [59]Rosenblatt F. Principles of Neurodynamics:Perceptrons and the Theory of Brain Mechanisms. [J]. Washington DC:Spartan,1962,
    [60]Calder B, Grunwald D, Jones M, et al. Evidence-based static branch prediction using machine learning [J]. ACM Transactions on Programming Languages and Systems (TOPLAS),1997,19(1):188-222.
    [61]Jimenez D A and Lin C. Dynamic branch prediction with perceptrons [C]. Proceedings of the High-Performance Computer Architecture,2001 HPCA The Seventh International Symposium on, IEEE,2001:197-206.
    [62]Jimenez D A. Fast path-based neural branch prediction [C]. Proceedings of the Microarchitecture,2003 MICRO-36 Proceedings 36th Annual IEEE/ACM International Symposium on, IEEE,2003:243-252.
    [63]Seznec A. Revisiting the perceptron predictor [J]. IRISA research reports, IRISA Editeur,2004,
    [64]Tarjan D and Skadron K. Revisiting the perceptron predictor again [J]. Technical R eport CS-2004-28, U niversity of V irginia,2004,
    [65]Jimenez D A and Lin C. Neural methods for dynamic branch prediction [J]. ACM Transactions on Computer Systems (TOCS),2002,20(4):369-397.
    [66]Seznec A. The O-GEHL branch predictor [J]. The 1st JILP Championship Branch Prediction Competition (CBP-1),2004,
    [67]Zhang J and Gu Z. Exposing the Shared Cache Behavior of Helper Thread on CMP Platforms [C]. Proceedings of the Computational Science and Engineering (CSE),2011 IEEE 14th International Conference on, IEEE,2011: 379-386.
    [68]Heydemann K, Bodin F, Knijnenburg P, et al. UFS:a global trade-off strategy for loop unrolling for VLIW architectures [J]. Concurrency and Computation:Practice and Experience,2006,18(11):1413-1434.
    [69]Bellas N, Hajj I, Polychronopoulos C, et al. Energy and performance improvements in microprocessor design using a loop cache [C]. Proceedings of the Computer Design,1999(ICCD'99) International Conference on, IEEE, 1999:378-383.
    [70]Lee L H, Moyer B and Arends J. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops [C]. Proceedings of the Low Power Electronics and Design,1999 Proceedings 1999 International Symposium on, IEEE,1999:267-269.
    [71]Gordon-Ross A, Cotterell S and Vahid F. Exploiting fixed programs in embedded systems:A loop cache example [J]. Computer Architecture Letters, 2002,1(1):2-2.
    [72]Lee L H, Moyer B and Arends J. Low-cost embedded program loop caching-revisited [J]. Ann Arbor,1999,1001(48109-42122.
    [73]Rotenberg E, Bennett S and Smith J E. Trace cache:a low latency approach to high bandwidth instruction fetching [C]. Proceedings of the Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society,1996:24-35.
    [74]Kyker A B and Krick R F. System and method for unrolling loops in a trace cache [M]. Google Patents.2003.
    [75]Peleg A and Weiser U. Dynamic flow instruction cache memory organized around trace segments independent of virtual address line [M]. Google Patents. 1995.
    [76]Friendly D H, Patel S J and Part Y N. Putting the fill unit to work:Dynamic optimizations for trace cache microprocessors [C]. Proceedings of the Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society Press,1998:173-181.
    [77]Rotenberg E, Jacobson Q, Sazeides Y, et al. Trace processors [C]. Proceedings of the Microarchitecture,1997 Proceedings, Thirtieth Annual IEEE/ACM International Symposium on, IEEE,1997:138-148.
    [78]Rasche G A, Rivers J A and Srinivasan V. Method and apparatus for an efficient multi-path trace cache design [M]. Google Patents.2011.
    [79]Upton M. The Intel Pentium(?) 4 Processor [M].2000.
    [80]Lee J K and Smith A J. Analysis of branch prediction strategies and branch target buffer design [M]. Computer Science Division (EECS), University of California,1983.
    [81]Perleberg C H and Smith A J. Branch target buffer design and optimization [J]. Computers, IEEE Transactions on,1993,42(4):396-412.
    [82]Levitan D, Thomas T and Tu P. The PowerPC 620 microprocessor:a high performance superscalar RISC microprocessor [C]. Proceedings of the Compcon'95'Technologies for the Information Superhighway', Digest of Papers, IEEE,1995:285-291.
    [83]Manne S, Klauser A and Grunwald D. Pipeline gating:speculation control for energy reduction [C]. Proceedings of the ACM SIGARCH Computer Architecture News, IEEE Computer Society,1998:132-141.
    [84]Parikh D, Skadron K, Zhang Y, et al. Power issues related to branch prediction [C]. Proceedings of the High-Performance Computer Architecture,2002 Proceedings Eighth International Symposium on, IEEE,2002:233-244.
    [85]Chang Y-J. Lazy BTB:reduce BTB energy consumption using dynamic profiling [C]. Proceedings of the Proceedings of the 2006 Asia and South Pacific Design Automation Conference, IEEE Press,2006:917-922.
    [86]喻明艳,张祥建 and 杨兵.基于跳跃访问控制的低功耗分支目标缓冲器设计[J].计算机辅助设计与图形学学报,2010,004)：695-702.
    [87]Chang Y-J. An energy-efficient BTB lookup scheme for embedded processors [J]. Circuits and Systems Ⅱ:Express Briefs, IEEE Transactions on,2006, 53(9):817-821.
    [88]Wang S, Hu J and Ziavras S G. BTB access filtering:a low energy and high performance design [C]. Proceedings of the Symposium on VLSI,2008 ISVLSI'08 IEEE Computer Society Annual. IEEE,2008:81-86.
    [89]Petrov P and Orailoglu A. Low-power branch target buffer for application-specific embedded processors [C]. Proceedings of the Computers and Digital Techniques, IEE Proceedings-, IET,2005:482-488.
    [90]陈志坚.嵌入式CPU超深流水线关键技术研究[D]；浙江大学,2011.
    [91]Skadron K, Ahuja P S, Martonosi M, et al. Improving prediction for procedure returns with return-address-stack repair mechanisms [C]. Proceedings of the Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society Press,1998:259-271.
    [92]Vandierendonck H and Seznec A. Speculative return address stack management revisited [J]. ACM Transactions on Architecture and Code Optimization (TACO),2008,5(3):15.
    [93]Cormie D. The ARM11 microarchitecture [J]. Retrieved July,2002,21 (2004.
    [94]Williamson D. Arm cortex a8:A high performance processor for low power applications [J]. Unique Chips and Systems,2007,79.
    [95]Atukorala S. Branch prediction methods used in modern superscalar processors [C]. Proceedings of the Information, Communications and Signal Processing,1997 ICICS, Proceedings of 1997 International Conference on, IEEE,1997:1475-1479.
    [96]Chiu G-Y, Yang H-C, Li W-H, et al. Mechanism for return stack and branch history corrections under misprediction in deep pipeline design [C]. Proceedings of the Computer Systems Architecture Conference,2008 ACS AC 2008 13th Asia-Pacific, IEEE,2008:1-8.
    [97]Chang P-Y, Hao E and Patt Y N. Target prediction for indirect jumps [C]. Proceedings of the ACM SIGARCH Computer Architecture News, ACM, 1997:274-283.
    [98]Calder B, Grunwald D and Zorn B. Quantifying behavioral differences between C and C++programs [J]. Journal of Programming languages,1994, 2(4):313-351.
    [99]Wolczko M. Benchmarking Java with the Richards benchmark [J]. Erhaltlich unter http://research sun com/ABBILDUNGSVERZEICHNIS,2011,41 (
    [100]Driesen K and Holzle U. The cascaded predictor:Economical and adaptive branch target prediction [C]. Proceedings of the Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society Press,1998:249-258.
    [101]Driesen K and Holzle U. Multi-stage cascaded prediction [J]. Euro-Par'99 Parallel Processing,1999,1312-1321.
    [102]Kalamatianos J and Kaeli D R. Predicting indirect branches via data compression [C]. Proceedings of the Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society Press,1998:272-281.
    [103]Hinton G, Sager D, Upton M, et al. The microarchitecture of the Pentium(?) 4 processor [C]. Proceedings of the Intel Technology Journal, Citeseer,2001.
    [104]Kessler R E. The alpha 21264 microprocessor [J]. Micro, IEEE,1999,19(2): 24-36.
    [105]Lee J K and Smith A J. Branch prediction strategies and branch target buffer design [J]. Computer,1984,17(1):6-22.
    [106]Driesen K and Holzle U. Accurate indirect branch prediction [C]. Proceedings of the Computer Architecture,1998 Proceedings The 25th Annual International Symposium on, IEEE,1998:167-178.
    [107]Gochman S, Ronen R, Anati I, et al. The intel pentium m processor: Microarchitecture and performance [J]. Intel Technology Journal,2003,7(2): 21-36.
    [108]Kim H, Joao J A, Mutlu O, et al. VPC prediction:reducing the cost of indirect branches via hardware-based dynamic devirtualization [J]. ACM SIGARCH Computer Architecture News,2007,35(2):424-435.
    [109]Deutsch L P and Schiffman A M. Efficient implementation of the Smalltalk-80 system [C]. Proceedings of the Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, ACM,1984:297-302.
    [110]Holzle U and Ungar D. Optimizing dynamically-dispatched calls with run-time type feedback [J]. ACM SIGPLAN Notices,1994,29(6):326-336.
    [111]Grove D, Dean J, Garrett C, et al. Profile-guided receiver class prediction [C]. Proceedings of the ACM SIGPLAN Notices, ACM,1995:108-123.
    [112]Calder B and Grunwald D. Reducing indirect function call overhead in C++ programs [C]. Proceedings of the Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, ACM,1994:397-408.
    [113]Ishizaki K, Kawahito M, Yasue T, et al. A study of devirtualization techniques for a Java Just-In-Time compiler [C]. Proceedings of the ACM SIGPLAN Notices, ACM,2000:294-310.
    [114]Hennessy J L and Patterson D A. Computer architecture:a quantitative approach [M]. Morgan Kaufmann,2011.
    [115]Ishii Y, Kuroyanagi K, Sawada T, et al. Revisiting Local History to Improve the Fused Two-Level Branch Predictor [C]. Proceedings of the Proceedings of the 3rd Championship on Branch Prediction,2011.
    [116]Jimenez D A. Oh-snap:Optimized hybrid scaled neural analog predictor [J]. Proceedings of the 3rd Championship on Branch Prediction, http://wwwjilp org/jwac-2,2011,
    [117]Zhang L, Tao F and Xiang J F. Researches on Design and Implementations of Two 2-Bit Predictors [C]. Proceedings of the Advanced Engineering Forum, Trans Tech Publ,2011:241-246.
    [118]Talcott A R, Nemirovsky M and Wood R C. The influence of branch prediction table interference on branch prediction scheme performance [C]. Proceedings of the Proceedings of the IFIP WG103 working conference on Parallel architectures and compilation techniques, IFIP Working Group on Algol,1995:89-98.
    [119]Sprangle E, Chappell R S, Alsup M, et al. The agree predictor:A mechanism for reducing negative branch history interference [C]. Proceedings of the ACM SIGARCH Computer Architecture News, ACM,1997:284-291.
    [120]Scott J, Lee L H, Arends J, et al. Designing the Low-Power M·CORE TM Architecture [C]. Proceedings of the Power Driven Microarchitecture Workshop, Citeseer,1998:145-150.
    [121]Seznec A and Michaud P. A case for (partially) TAgged GEometric history length branch prediction [J]. Journal of Instruction Level Parallelism,2006, 8(1-23.
    [122]Seznec A. Analysis of the O-GEometric History Length branch predictor [C]. Proceedings of the Computer Architecture,2005 ISCA'05 Proceedings 32nd International Symposium on, IEEE,2005:394-405.
    [123]Seznec A. A new case for the TAGE branch predictor [C]. Proceedings of the Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM,2011:117-127.
    [124]Poovey J A, Conte T M, Levy M, et al. A benchmark characterization of the eembc benchmark suite [J]. Micro, IEEE,2009,29(5):18-29.
    [125]St Amant R, Jimenez D A and Burger D. Low-power, high-performance analog neural branch prediction [C]. Proceedings of the Microarchitecture, 2008 MICRO-41 2008 41st IEEE/ACM International Symposium on, IEEE, 2008:447-458.
    [126]Mahlke S A, Hank R E, Bringmann R A, et al. Characterizing the impact of predicated execution on branch prediction [C]. Proceedings of the Microarchitecture,1994 MICRO-27 Proceedings of the 27th Annual International Symposium on, IEEE,1994:217-227.
    [127]Yeh T-Y, Marr D T and Patt Y N. Increasing the instruction fetch rate via multiple branch prediction and a branch address cache [C]. Proceedings of the Proceedings of the 7th international conference on Supercomputing, ACM, 1993:67-76.
    [128]Seznec A, Jourdan S, Sainrat P, et al. Multiple-block ahead branch predictors [M]. ACM,1996.
    [129]Dutta S and Franklin M. Control flow prediction schemes for wide-issue superscalar processors [J]. Parallel and Distributed Systems, IEEE Transactions on,1999,10(4):346-359.
    [130]Luick D A. Multi-ported and interleaved cache memory supporting multiple simultaneous accesses thereto [M]. Google Patents.1999.
    [131]Rotenberg E, Bennett S and Smith J E. A trace cache microarchitecture and evaluation [J]. Computers, IEEE Transactions on,1999,48(2):111-120.
    [132]Paul J M and Meyer B H. Amdahl's law revisited for single chip systems [J]. International Journal of Parallel Programming,2007,35(2):101-123.
    [133]Maiyuran S, Smith P J and Jourdan S. Method and apparatus for a stew-based loop predictor [M]. Google Patents.2006.
    [134]Fite D B, Hetherington R C, McKeon M M, et al. Virtual instruction cache system using length responsive decoded instruction shifting and merging with prefetch buffer outputs to fill instruction buffer [M]. Google Patents.1992.
    [135]Buyukkurt B, Guo Z and Najjar W A. Impact of loop unrolling on area, throughput and clock frequency in ROCCC:C to VHDL compiler for FPGAs [M]. Reconfigurable Computing:Architectures and Applications. Springer. 2006:401-412.
    [136]Infineon. TriCore Architecural Manual [M].1997.
    [137]Instruments T. TMS320C2x User's Guide. January 1993 [J]. Revision C,
    [138]Ditzel D R and McLellan H R. Branch folding in the CRISP microprocessor: reducing branch delay to zero [C]. Proceedings of the Proceedings of the 14th annual international symposium on Computer architecture, ACM,1987:2-8.
    [139]Lee L H, Scott J, Moyer B, et al. Low-cost branch folding for embedded applications with small tight loops [C]. Proceedings of the Microarchitecture, 1999 MICRO-32 Proceedings 32nd Annual International Symposium on, IEEE,1999:103-111.
    [140]孟建熠.超标量嵌入式处理器关键技术设计研究[D]；浙江大学,2009.
    [141]孟建熠,严晓浪,葛海通,et al.基于指令回收的低功耗循环分支折合技术[J].浙江大学学报：工学版,2010,004)：632-638。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700