基于动态二进制翻译的龙芯虚拟机中数据预取优化研究

英文题名：Data Prefetching Optimization Research in Dynamic Binary Translation Based Loongson Virtual Machine
作者：罗琼程
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：二进制翻译 ; 数据预取 ; 动态优化 ; 超级块 ; 龙芯
英文关键词：binary translation ; data prefetching ; dynamic optimization ; superblock ; loongson
学位年度：2009
导师：吴强
学科代码：081201
学位授予单位：湖南大学
论文提交日期：2009-04-25
答辩委员会主席：邝继顺

摘要

二进制翻译作为实现代码移植的一种软件手段,能将某一体系结构下的可执行二进制程序在没有其源代码的情况下翻译转换成能在其它体系结构下运行的二进制代码。动态二进制翻译就是边翻译边执行,并在翻译的过程中进行动态优化。随着微处理器技术、编译技术的发展,二进制翻译逐渐成为研究的一个热点方向,在虚拟化技术、分布式计算及信息安全等方面得到了广泛重视。当前,微处理器频率不断提高,而内存频率的提升进展缓慢,其性能差距越来越大,对内存的访问早成为制约程序性能的瓶颈。作为访存优化的一种重要方法,数据预取可以将随后使用的数据提前读进高速缓存,这样能有效隐藏访存延迟,提高程序性能。
     本文在动态二进制翻译系统中对数据预取优化进行研究。首先结合龙芯处理器的硬件特性,采用软件插桩方式收集应用程序的访存指令其执行周期及步长变化信息来识别发生Cache缺失的延迟指令,并依此进行分类,接着对程序中的热代码构造数据预取优化单元——超级块(SuperBlock),在此基础上实现了SuperBlock基本数据预取方案。最后,通过对SuperBlock进行数据流分析得出的寄存器定值引用关系,提出了基于访存指令地址计算分量列表等不同预取优化策略。
     通过在龙芯3号虚拟机上实验验证,SuperBlock构造在开销小于1%的情况下能够提高翻译后SPEC2000整点测试程序的平均性能达10%。虽然数据预取对于整点测试集没有明显的优化效果,但可以对翻译后浮点测试程序达到3.3%的性能加速比,而其SuperBlock构造及预取分析开销远小于0.5%。
As software means of code migration, binary translation can convert executable binaries in the absence of its sources from one instruction set architecture to the other. Dynamic binary translation is just-in-time translation, while at the same time, it can do dynamic optimization. With the progress of processor、compiler technology, binary translation has become a hot research direction and has got extensive attention in virtualization technology、distributed computing and information security, et al. Currently, the rate of improvement in microprocessor speed exceeds the rate of improvement in Dynamic Random Access Memory speed. Hence the increasing Processor-Memory Performance Gap is now the primary obstacle to improved computer system performance. Dynamic optimization is an important research subject in binary translation systems. As a way of memory optimization, data prefetching reads subsequent data into high speed caches ahead of time to hide memory access latency, which can improve application’s performance.
     In this thesis, data prefetching optimization is studied for the first time in a binary translation system. It combines Loongson processor’s hardware features and uses software instrumentation to collect program’s memory access latency information. Firstly, delinquent loads and their types are identified, data prefetching optimization unit—SuperBlocks are then constructed from frequently executed code blocks, after that, basic prefetching optimization is realized. Secondly, the data flow analysis is performed on SuperBlock to generate the RDUG(register define and use graph), and then some data prefetching schemes are proposed based on load instructions’address computing components.
     Experiments on the dynamic binary translation based Loongson-3 virtual machine show that, Superblock construction can achieve 10% improvement on average for translated SPEC2000 int programs while the overhead is less than 1%. Although data prefetching optimization has lettle effect on SPEC2000 int, it can improve the performance of SPEC2000 float programs by 3.3% on average with the analysis overhead far less than 0.5%.

引文

[1] Rosenblum M, Garfinkel T. Virtual Machine Monitors: Current Technology and Future Trends. IEEE Computer, 2005, 38(5): 39-47
    [2] Chen Yu, Ren Jie, Zhu Hui, et al. Dynamic Binary Translation and Optimization in a Whole System Emulator-SkyEye. ICPPW’06. Washington: IEEE Computer Society, 2006, 327-336
    [3] Bellard F. QEMU, a fast and portable dynamic translator. In: Proc of the Annual Conference on USENIX Annual Technical Conference. Anaheim, 2005, 41-46
    [4] Intel IA-64 Arichitecture Software Developer’s Manual Volume 1: IA-64 Application Architecture, Revision 1.1, July 2000
    [5] Leonid B, Tevi D, Orna E. IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In: Proc of the 36th International Symposium on Microarchitecture. Washington: IEEE Computer Society, 2003, 191-201
    [6] Cifuentes C, Malhotra V. Binary Translation: Static, Dynamic, Retargetable? In: Proc of International Conference on Software Maintenance. Monterey, 1996, 340-349
    [7] Bergh A. Keilman K, Magenheimer D, et al. HP3000 emulation on HP precision architecture computers. Hewlett-Packard Journal, 1987, 87-89
    [8] Anton C, Mark H, Ray H, et al. FX!32: a Profile-Directed Binary Translator. IEEE Micro, 1998, 18(2): 56-64
    [9] Hookway R J, Herdeg M A. Digital FX!32: Combining Emulation and Binary Translation. Digital Technical Journal, 1997, 9(1): 3-12
    [10] Hookway R. DIGITAL FX!32: running 32-Bit x86 applications on Alpha NT. http://www.usenix.org/publications/library/proceedings/usenix-nt97/full_papers/chernoff/chernoff.pdf, 2008-10-01
    [11] Ebcioglu K, Altman E. DAISY: Dynamic Compilation for 100% Architectural Compatibility. In: Proc of ISCA24. New Work: ACM, 1997, 26-37
    [12] Ebcioglu K. Execution-Based Scheduling for VLIW Architectures. In: Proc of the 5th International Euro-Par Conference on Parallel Processing. London, 1999, 1269-1280
    [13] Michael G. Dynamic and Transparent Binary Translation. IEEE Computer, 2000, 33(3): 54-59
    [14] Michael G, Erik A. Inherently Lower Complexity Architectures using Dynamic Optimization. http://www.research.ibm.com/vliw/Pdf/wced02.pdf, 2008-10-01
    [15] Alexander K. The Technology behind Crusoe Processor. Transmeta technology report, 2000, 3-12
    [16] Transmeta Corporation, http://www.transmeta.com/index2.html, 2008-10-01
    [17] Cifuentes C, Emmerik M V. UQBT: Adaptable Binary Translation at Low Cost. IEEE Computer, 2000, 33(3): 60-66
    [18] Cifuentes C, Emmerik M V, Ung D, et al. Preliminary Experiences with the Use of the UQBT Binary Translation Framework. In: Proc of the Workshop on Binary Translation. Los Alamitos, 1999, 12-22
    [19] Ung D, Cifuentes C. Optimising Hot Paths in a Dynamic Binary Translator. ACM SIGARCH Computer Architecture News, 2000, 29(1): 55-65
    [20] Andrews K, Sand D. Migrating a CISC computer family onto RISC via object code translation. ACM SIGPLAN Notices, 1992, 27(9): 213-222
    [21] Cindy Z, Thompson C. PA-RISC to IA-64: Transparent Execution, No Recompilation. IEEE Computer, 2000, 33(3): 47-52
    [22] Kevin S, Jack D. Retargetable and Reconfigurable Software Dynamic Translation. In: Proc of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. Washington: IEEE Computer Society, 2003, 36-47
    [23] Kevin S, Jack D. Strata: A software dynamic translation infrastructure. Technical Report: CS-2001-17, Virginia: University of Virginia, 2001
    [24] Transitive Corp. http://www.transitive.com/index, 2008-10-01
    [25] Bala V, Duesterwald E, Banerjia S. Dynamo: A Transparent Dynamic Optimization System. In: Proc. of the ACM SIGPLAN 2000 conference on Programming language design and implementation. New Work: ACM, 2000, 1-12
    [26] Cramer T, Friedman, R. Compiling java just in time. IEEE Micro, 1997, 17(3): 36-43.
    [27] Byung-Sun Y, Soo-Mook M, Seongbae P, et al. LaTTe: A Java VM Just-in-Time Compiler with Fast and Efficient Register Allocation. In: Proc. of the 1999 International Conference on Parallel Architectures and Compilation Techniques. Washington: IEEE Computer Society, 1999, 128-138
    [28] Cierniak M, Lueh G, Stichnoth J. Practicing JUDO: Java Under Dynamic Optimizations. ACM SIGPLAN Notices, 2000, 35(5): 13-26
    [29]马湘宁,武成岗,唐锋等.二进制翻译中的标志位优化技术.计算机研究与发展, 2005, 42(2): 329-337
    [30]杨浩,唐锋,武成岗等.二进制翻译中的库函数处理.计算机研究与发展, 2006, 43(12): 2174-2179
    [31] Li Jianjun, Wu Chenggang, Hsu Wei-Chung, et al. An Evaluation of Misaligned Data Access Handling Mechanisms in Dynamic Binary Translation Systems. In: Proc of 7 th Annual IEEE/ACM International Symposium on Code Generation and Optimization.Seattle, WA, 2009
    [32] Nihar R M, Balakrishna V. The processor-memory bottleneck: problems and solutions. http://www.acm.org/crossroads/xrds5-3/pmgap.html, 2008-10-01
    [33] James E S, Ravi N.虚拟机-系统与进程的通用平台.北京:电子工业出版社, 2006
    [34] Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization. In: Proc of the 19th ACM symposium on Operating systems principles. New Work: ACM, 2003, 164-177
    [35]中国科学院计算技术研究所,意法半导体公司.龙芯2E处理器用户手册, 2006
    [36] SPEC CPU2000, http://www.spec.org/cpu2000/, 2008-10-01
    [37] John L H. SPEC CPU2000: Measuring CPU Performance in the New Millennium. IEEE Computer, 2000, 33(7): 28-35
    [38] Wen-mei W H, Scott A M. The SuperBlock: An Effective Technique for VLIW and Superscalar Compilation. Instruction-level parallel processors, 1995, 234-253
    [39] Jiwei Lu, Howard Chen, Pen-Chung Yew et al. Design and Implementation of a Lightweight Dynamic Optimization System. Journal of Instruction-Level Parallelism, 2004, 1-24
    [40] Howard Chen, Wei-Chung Hsu, Jiwei Lu, et al. Dynamic trace selection using performance monitoring hardware sampling. In: Proc of the international symposium on Code generation and Optimization: feedback-directed and runtime optimization. Washington: IEEE Computer Society, 2003, 79-90
    [41] Vanderwie S P, Lilja D J. Data prefetch mechanisms. ACM Computing Survey, 2000, 32(2): 174-199
    [42] Emma PG, Hartstein A, Puzak T R, et al. Exploring the limits of prefetching. IBM Journal of Research and Development, 2005, 49(1): 127–144
    [43] Luk C K, Mowry T C. Compiler-Based Prefetching For Recursive Data Structures. ACM SIGOPS Operating Systems Review, 1996, 30(5): 222–233
    [44] Roth A, Sohi G S. Effective Jump-Pointer Prefetching for Linked Data Structures. ACM SIGARCH Computer Architecture News, 1999, 27(2): 111–121
    [45] Srinath S, Kim O M, Patt Y N. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In: Proc of the 13th Int. Symp. on High-Performance Computer Architecture. Washington: IEEE Computer Society, 2007, 63-74
    [46] Yamamoto A, Tanaka Y, Ando H, et al. Data Prefetching and Address Pre-Calculation through Instruction Pre-Execution with Two-Step Physical Register Deallocation. In: Proc of the workshop on MEmory performance: DEaling with Applications, systems and architecture. New Work: ACM, 2007, 33-40
    [47] Bekerman1 M, Yoaz A. Early Load Address Resolution Via Register Tracking. In: Proc. of the 27th annual international symposium on Computer architecture. New Work: ACM, 2000, 306-315
    [48] Klaiber A C, Levy H M. An Architecture for Software-Controlled Data Prefetching. In: Proc of the 18th annual international symposium on Computer architecture. New Work: ACM, 1991, 43-53
    [49] Jiwei Lu, Howard Chen, Rao Fu, et al. The performance of runtime data Cache prefetching in a dynamic optimization system. In 36th Annual IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Computer Socienty, 2003, 180-190
    [50] Beyler J C, Clauss P. Performance Driven Data Cache Prefetching in a Dynamic Software Optimization System. In: Proc. of the 21st annual international conference on Supercomputing, New Work: ACM, 2007, 202-209
    [51] Chilimbi T M, Hirzel M. Dynamic Hot Data Stream Prefetching for General-Purpose Programs. ACM SIGPLAN Notices, 2002, 37(5): 199–209
    [52] Chilimbi T M. Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality. ACM SIGPLAN Notices, 2001, 36(5): 191-202

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700