基于并行处理单元的代码优化方法研究

英文题名：Study of Code Optimizing Method Based on Parallel Functional Units
作者：邱春武
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：VLIW ; DSP ; 代码优化 ; 并行单元 ; 簇分配 ; 调度
英文关键词：VLIW DSP ; Code Optimizer ; Parallel Units ; Cluster Assigning ; Scheduling
学位年度：2008
导师：余文
学科代码：081203
学位授予单位：北京邮电大学
论文提交日期：2008-03-01

摘要

与传统DSP相比,现代DSP采用更多的ILP技术以提高机器性能。本文讨论的DSP采用分簇的VLIW体系结构,能够在单个时钟周期同时执行多个操作。本文先讨论这款DSP代码优化器的构造方法,之后对TI TMS320DM642给出了代码优化器的具体实现。
     VLIW DSP代码优化器在LCC编译器框架基础上实现。首先用LCC作为编译前端得到中间代码,然后对中间代码进行模版注释得到目标机器指令相对应的程序,最后对其进行簇分配和调度,同时分配寄存器和功能单元,得到优化的并行汇编代码。
     我们为VLIW DSP定制它的机器规格说明和机器描述,书写代码生成规则的iburg规范文本,并由iburg规范自动生成代码优化器中的指令选择部分。这样提高了VLIW DSP的代码优化器的可重定目标性。
     VLIW DSP体系结构的一个显著特点是分簇,与这一特点相对应,代码生成的一个重要步骤是簇分配,即为每个操作及其操作数映射合适的簇。簇分配应使得各簇的功能单元得到充分利用,并设法减少簇之间的数据传递。本文讨论了簇分配的常用算法和LIST调度算法,最后给出统一的簇分配与调度算法(UAS)针对VLIW DSP的实现。该算法的特点是簇分配与调度一同进行,当调度一个操作时,同时为这个操作和它的操作数分配合适的簇。实验证明本文给出的代码优化方法对于常用的DSP算法具有较好的优化效果。
Compared with traditional DSP, modern DSP use more ILP technologies to improve its performance. The DSP we discuss in this thesis uses a clustered VLIW architecture and can perform multiple operations simultaneously during a single clock cycle .we discuss the construction of the code optimizer of VLIW DSP and present the implementation of TI TMS320DM642 especially.
     VLIW DSP code optimizer is implemented based on LCC compiler framework. First, we get the intermediate code from LCC frontend. And then we select instruction of target machine by template matching for intermediate code. Finally, we get assemble code that can be parallel processing. by cluster assigning、instruction scheduling and register & functional units assigning simultaneously.
     We customized a machine specification and a machine description for VLIW DSP. We write iburg specification which contains code generating rules, and iburg reads the specification and generates the instruction selection code. It improves the VLIW DSP code optimizer's retargetability.
     One prominent features of our DSP' architecture is clustering. With this feature, an important phase of our code optimization is cluster assigning, which maps operations and their operands to appropriate clusters. Cluster assignment should make maximal use of functional units across clusters, and reduce inter-cluster data movement besides. We discuss traditional cluster assignment algorithm and LIST instruction-scheduling algorithm, and implement the Unified Assign Schedule (UAS) algorithm to support cluster assignment, which has the following features: cluster assigning and scheduling are unified, and when scheduling an operation, the operation and its operands are assigned to their appropriate clusters at the same time. Experiments show that the code optimizer in this thesis is very effective in optimization of classical DSP Algorithm

引文

[1]Texas Instruments Incorporation.TMS320C6000 Optimizing Compiler User's Guide[Z].USA:TI,2002.
    [2]Texas Instruments Incorporation.TMS320C6000 CPU and Instruction Set Reference Guide[Z].USA:TI,2000.
    [3]张世杰,郑林华.TMS320C6000汇编和C语言的混合编程.微处理机[J],2003,4:29-31.
    [4]任丽香,马淑芬,李方慧.TMS320C6000系列DSPs的原理与应用[M].北京:电子工业出版社,2000.
    [5]邱春武,余文,杨大利.基于TMS320C6000系列DSP代码优化方法研究.[EB/OL].中国科技论文在线,http://www.paper.edu.cn.
    [6]Texas Instruments Incorporation.TMS320C6000 Programmer's Guide [Z].USA:TI,2002.
    [7]Texas Instruments Incorporation.TMS320C6000 Assembly Language Tools User'sGuide[Z].USA:TI,2002.
    [8]S.Rathnam,G.Slavenburg,Processing the new world of interactive media.The TriMedia VLIW CPU architecture,IEEE Signal Processing Magazine 15(2)1998,pp.108-117.
    [9]Jang S,Carr S,Sweany P,et al.A Code Generation Framework for VLIW Architectures with Partitioned Register Banks.In Procs.of 3rd.Int.Conf.on Massively Parallel Computing Systems,1998,04.pp.65-68.
    [10]Intel Corporation.IA-64 Application Developer's Architecture Guide.1999-05.
    [11]Moreno J H,Moudgill M,Ebcloglu K,et al.Architecture,Compilerand Simulation of a Tree-based VLIW Processor.IBM ResearchReport,RC 20495,T.J.Watson Research Center,IBM Resarch.
    [12]Chang P P,Mahlke S A,Chen W Y,et al.IMPACT:An Architectural Framework for Myltiple-Instruction-Issue Processors.ACM Computer Architecture News,SIGARCH,1991.pp.266-275.
    [13]Moren J H,Moudgill M.Scalable Instruction-level Parallelism Through Tree-instructions.IBM Research Report,RC 20661(91417),T.J.Watson Research Center,IBM Resarch Division,1996-12-09,pp.172-186.
    [14]Conte T,Sathaye S.DynamicRrescheduling:A Technique for Objectcode Compatibility in VLIW Architectures.In Proceedings of 28th Annual International Symposium on Microarchitecture(MICRO28),1995,pp.272-296.
    [15]Rau B R.Dynamic Scheduling Techniques for VLIW processors.Technical Report HPL-93-52,Computer Research Center,Hewlett-Packard Company,1993-06,pp.46-49.
    [16]胡定磊,陈书明,刘春林.分簇结构超长指令字DSP编译器的设计与实现.小型微型计算机系统.2006,27(2).
    [17]陈火旺.程序设计语言编译原理[M].3版.长沙:国防工业出版社,2000.
    [18]Rau B R,Kathail V,Aditya S.Machine-description Driven Compilers for Epic Processors.Tech.Rep.HPL-9840,Hewlett Packard Research Labs,1998,pp.123-127.
    [19]蒋立源,康慕宁.编译原理[M].2版.西安:西北工业大学出版社,2001.
    [20]李宝峰,窦勇,周兴铭.基于LCC的LEAP编译器设计与实现[J].计算机工程与科学,2005,27(1):61-64.
    [21]Christopher W Fraser,David R Hanson.A retargetable C compiler:Design and implementation[M].北京:电子工业出版社,2005.
    [22]Aho A V,Semi R,Ullman J D.Compilers:Principles,techniques and tools[M].北京:人民邮电出版社,2002.
    [23]Christopher W Fraser,David R Hanson.The Lcc 4.x Code-Generator Interface[R].MSR-TR-2001-64,2001.
    [24]Yong Dou,Xicheng Lu.Mapping Data-Flow Graph to Loop Engine on Array Processor[A].The 5th Int'l Workshop on Advanced Parallel Processing Technologies(APPT'03)[C],2003.
    [25]Christopher W Fraser,David R Hanson,Todd A Proebsting.Engineering a Simple,Efficient Code Generator Generator[J].ACM Letters on Programming Languages and Systems,1992,1(3):213-216.
    [26]Alexandre E Eichenberger,Edward S Davidson.Register allocation for predicated code[C].The 28th Annual Int'l Symp on Microarchitecture,Ann Arbor,Michigan,USA,1995.pp.64-69.
    [27]David M Gillies,Dz-ching Roy J u,Richard Johnson,et al.Global predicate analysis and its application to register allocation[C].The 29th Annual Int'l Symp on Microarchitecture,Pairs,1996,pp.213-220.
    [28]John W Sias,Wen-mei W Hwu,David I August.Accurate and efficient predicate analysis with binary decision diagrams[C]The 33rd Annual Int'l Symp on Microarchitecture,Haifa,Israel,2000,pp.76-81.
    [29]J C Park,M S Schlansker.On.predicated execution[R].Hewlett Packard Laboratories Tech Rep:HPL-91-58,1991,pp.64-69.
    [30]M Guthaus,J Ringenberg,T Austin,et all Mibench:A free,commercially representative embedded benchmark suite[C].IEEE 4th Annual Workshop on Workload Characterization,Austin,TX,2001,pp.123-130.
    [31]Gyllenhaal J C,Rau B R,Hwu W W.Hmdes Version 2.0Specification.Technical Report IMPACT-96-3,The IMPACT Research Group,University of Illinois,Urbana,IL,1996,pp.56-59.
    [32]袁正才,刘春林,胡定磊.一种基于机器描述的VLIW DSP编译技术[J]计算机工程,2004,30(22):79-80.
    [33]Viktor Lapinskii,Margarida F Jacome,et al.Cluster Assignment for High Performance Embedded VLIW Processors[J].ACM Transactions on Design Automation of Electronic Systems,2002,7(3),pp.430-454.
    [34]P P Chang,D MLavery,et al.The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors[J].IEEE Transactions on Computers,1995,44(3)pp.353-370.
    [35]S Rixner,WDally,B Khailany,et al.Register Organization for Media Processing[C].Proceedings of the 26th International Symposium on High Performance Computer Architecture.
    [36]S Jang,S Carr,et al.A Code Generation Framework for VLIW Architectures with Partitioned Register Banks[C].Proc.of 3rd Int Conf.on Massively Parallel Computing Systems,1998,pp.78-83.
    [37]G Desoli.Instruction Assignment for Clustered VLIW DSP Compilers:A New Approach[R]Technical Report HPL-98-13,Hewlett-Packard Company,1998,pp.1145-148.
    [38]R Leupers.Instruction Scheduling for Clustered VLIW DSPs[C].Proceedings of the International Conference on Parallel Architecture and Compilation Techniques,Philadelphia,PA,2000,pp.67-72.
    [39]Viktor Lapinskii,Margarida F Jacome,Gustavo de Veciana.High Quality Operation Binding for Clustered VLIW Datapaths[C].Proceedings ofIEEE/ACM Design Automation Conference(DAC'2001),2001,pp.112-118.
    [40]S-M Moon,K Ebcioglu.An Efficient Resource-Constrained Global Scheduling Technique for Superscalar and VLIW Processors[C].25th Annual International Symposium on Microarchitecture,Portland,Oregon,1992,pp.235-250.
    [41]E.Ozer,SBanerjia,T M Conte.Unified Assign and Schedule:A New Approach to Scheduling for Clustered Register File Microarchitectures [C].Proceedings of the 31st Annual International Symposium on Microarchitecture,1998,pp.240-252.
    [42]J A Fisher.Trace Scheduling:A Technique for Global Microcode Compaction[J].IEEE Transactions on Computers,1981,C230:478-490.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700