一种多线程阵列众核处理器的二级Cache划分机制

英文篇名：A L2 cache partitioning mechanism for multithreaded array-based many-core processors
作者：陈逸飞 ; 朱蕾 ; 李宏亮
英文作者：CHEN Yi-fei;ZHU Lei;LI Hong-liang;Jiangnan Institute of Computing Technology;
关键词：阵列众核处理器 ; 同时多线程 ; 共享二级Cache划分机制
英文关键词：array-based many-core processor;;simultaneous multithreading;;shared L2 cache partitioning mechanism
中文刊名：JSJK
英文刊名：Computer Engineering & Science
机构：江南计算技术研究所;
出版日期：2019-03-15
出版单位：计算机工程与科学
年：2019
期：v.41;No.291
语种：中文;
页：JSJK201903003
页数：9
CN：03
ISSN：43-1258/TP
分类号：20-28

摘要

阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。
Because of its high computational performance and energy efficiency ratio, array-based many-core processors have been widely used in the high performance computing field. To build future high performance computing systems, processor must solve the severe challenge of ‘memory wall' and core synergy problem. In a typical array-based many-core processor, the core adopts the single-threaded structure to reduce overhead. However, the demand for memory access is higher. We introduce the hardware simultaneous multithreading technology into the single core structure. Aiming at the problem that the utilization rate of the single-core multi-threaded L2 cache is significantly low, we present a L2 cache partitioning mechanism(thread-based cache partitioning) for the array-based many-core processor. Experimental results demonstrate that, based on the L2 cache partition mechanism, the miss rate of the L2 instruction cache is decreased by 18.59%, the miss rate of the L2 data cache is decreased by 6.60% and the CPI performance is increased by 10.1%.

引文

[1] Keckler S W,Dally W J,Khailany B,et al.GPUs and the future of parallel computing[J].IEEE Micro,2011,31(5):7-17.
    [2] Saule E,Catalyurek ü V.An early evaluation of the scalability of graph algorithms on the Intel MIC architecture[C]//Proc of the 2012 International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012:1629-1639.
    [3] Dan B,Sander B.Applying AMD's Kaveri APU for heterogeneous computing[C]//Proc of 2014 IEEE Hot Chips 26 Symposium, 2014:1-42.
    [4] Taylor M B,Kim J,Miller J,et al.The raw microprocessor:A computational fabric for software circuits and general-purpose programs[J].IEEE Micro,2002,22(2):25-35.
    [5] Dally W J,Balfour J,Black-Shaffer D,et al.Efficient embedded computing[J].Computer,2010,41(7):27-32.
    [6] Wentzlaff D,Griffin P,Hoffmann H,et al.On-Chip Interconnection architecture of the tile processor[J].IEEE Micro,2007,27(5):15-31.
    [7] Fan D,Zhang H,Wang D,et al.Godson-T:An efficient many-core processor exploring thread-level parallelism[J].IEEE Micro,2012,32(2):38-47.
    [8] Olofsson A,Nordstrom T,Ul-Abdin Z.Kickstarting high-performance energy-efficient manycore architectures with Epiphany[C]//Proc of 2014 Asilomar Conference on Signals,Systems and Computers,2014:1719-1726.
    [9] Dinechin B D D,Massas P G D,Lager G,et al.A distributed run-time environment for the Kalray MPPA?-256 integrated manycore processor[J].Procedia Computer Science,2013,18:1654-1663.
    [10] Yoshifuji N, Sakamoto R,Nitadori K,et al.Implementation and evaluation of data-compression algorithms for irregular-grid iterative methods on the PEZY-SC processor[C]//Proc of the 6th Workshop on Irregular Applications:Architectures & Algorithms,2017:58-61.
    [11] Tullsen D M,Eggers S J,Emer J S,et al.Exploiting choice:Instruction fetch and issue on an implementable simultaneous multithreading processor[C]//Proc of the 23rd Annual International Symposium on Computer Architecture,1996:192-202.
    [12] Shah M,Barren J,Brooks J,et al.UltraSPARC T2:A highly-treaded,power-efficient,SPARC SOC[C]//Proc of IEEE Asian Solid-State Circuits Conference,2007:22-25.
    [13] Liu C,Sivasubramaniam A,Kandemir M.Organizing the last line of defense before hitting the memory wall for CMPs[C]//Proc of the 10th International Symposium on High Performance Computer Architecture,2004:176-185.
    [14] Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture[C]//Proc of the 13th International Conference on Parallel Architecture and Compilation Techniques, 2004:111-122.
    [15] Dybdahl H,Natvig L.A cache-partitioning aware replacement policy for chip multiprocessors[C]//Proc of International Conference on High Performance Computing,2006:22-34.
    [16] Song Feng-long, Liu Zhi-yong, Fan Dong-rui, et al. An implicity dynamic shared Cache isolation in many-core architecture [J]. Chinese Journal of Computers, 2009,32(10):1896-1904.(in Chinese)
    [17] Butko A,Garibotti R,Ost L,et al.Accuracy evaluation of GEM5 simulator system[C]//Proc of International Workshop on Reconfigurable Communication-Centric Systems-On-Chip,2012:1-7.
    [18] Nair A A,John L K.Simulation points for SPEC CPU 2006[C]//Proc of IEEE International Conference on Computer Design,2009:397-403.
    [19] Binkert N,Beckmann B,Black G,et al.GEM5 simulator:A modular platform for computer-system architecture research[EB/OL].[2016-04-21].http://www.gem5.org.
    [16] 宋风龙,刘志勇,范东睿,等.一种片上众核结构共享Cache动态隐式隔离机制研究[J].计算机学报,2009,32(10):1896-1904.