SoC中应用类IP核高级综合技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

SoC中应用类IP核高级综合技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on High Level Synthesis of IP Core for Specific Applications
作者：董亚卓
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：SoC ; IP核 ; 高级综合 ; 滑动窗口应用 ; 数据重用 ; 设计空间探索
英文关键词：SoC ; IP core ; high level synthesis ; sliding-window operation ; data reuse ; design space exploration
学位年度：2008
导师：窦勇
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2008-04-01

摘要

近年来,随着集成电路设计和工艺技术水平的快速提高,片上系统SoC设计技术得到越来越广泛的应用,已经逐步涉及到电子设计技术的诸多领域。SoC设计技术已经成为当今超大规模集成电路的发展趋势。
     在SoC设计中,IP核是其设计的基础和核心,SoC设计需要尽可能地使用现有IP,以搭积木的方式完成大部分设计。其中,应用类IP核设计是SoC创新性的体现,也是制约SoC快速构建的关键。IP核高级综合技术实现将硬件行为级描述转化为结构描述,甚至布图描述,提高了抽象级别,使设计者从繁杂的底层设计细节中解脱出来,更加专注于整个系统的设计,提高了设计的效率和正确率,降低了设计成本。IP核高级综合技术自提出以来,引起了学术界和工业界的高度重视,并且在未来的设计中将占据更加重要的地位。
     本文主要面向一类应用程序类型——滑动窗口应用展开研究。滑动窗口广泛应用于图像处理、模式识别和数字信号处理领域,它具有数据量大,计算密集等特点。滑动窗口应用因其访存的特殊性,而成为很多高级综合工具研究的入手点。令人遗憾的是,现有的高级综合系统在解决滑动窗口应用中还存在各种不足,或者没有明确的体系结构模型,或者没有充分开发数据重用,或者为实现数据重用使用了过多的硬件资源,或者没有进行设计空间探索优化。本文在现有工作的基础上,系统的研究了面向滑动窗口应用的IP核的高级综合技术,主要对以下几个方面的问题进行了研究。
     针对现有体系结构模型的不足,本文首先提出了IP核的参数化三层存储结构模型,设计目标是充分开发滑动窗口应用中存在的数据重用,减少访存次数,加快程序执行速度。该模型采用三级存储层次和寄存器轮转策略,充分开发循环层内和循环层间数据重用,其具体结构由若干参数确定,参数值由编译器根据具体滑动窗口应用的特点在编译阶段确定。本文针对不同类型的数据重用,提出了参数提取算法。实验结果表明,与相关工作相比,本文提出的存储结构模型使用相对较少的存储单元,将程序执行节拍减少了2.13到3.8倍,将程序执行频率由69MHz提升到了200MHz以上。
     在参数化三层存储结构模型的基础上,本文研究了IP核RTL级硬件描述文件的自动生成。设计目标是实现IP核的可综合Verilog代码自动生成。该过程包括三部分:控制状态机自动生成、运算流水线自动生成和整体封装模块生成。首先,编译器将滑动窗口应用源程序划分为控制部分和运算部分。通过在编译平台上对程序控制部分进行分析,获得循环信息(循环初值、终值和步进值)和数据重用信息,本文提出的控制状态机自动生成算法根据这些信息,实现控制状态机的自动生成。源程序运算部分在编译平台上经过数据结构定义、相关性分析等操作,输出数据流图描述文件,再经过运算流水段划分,生成新的程序中间表述IR(Intermediate Representation),最后,调用相应的运算单元IP函数,实现运算流水线的自动生成。整体封装模块将控制单元、运算流水线和暂存单元等模块集成,实现RTL级IP核硬件描述文件的生成。这种方法避免了手工映射的复杂性和低效性,实现自动映射,并且结果比较优化。
     在此基础上,本文进一步研究了片上资源足够和不足两种情况下的设计空间探索技术。当片上资源足够时,本文设计了一种基于硬件流水结构的设计空间探索方法,设计目标是充分利用片上资源,提高算法并行度,减少程序执行节拍。其基本思想为在程序正式加载到目标开发板之前,综合考虑片上系统提供的各种资源(主要为芯片面积、存储带宽和存储资源,本文用片上逻辑计算部件个数来衡量片上面积资源),生成能充分利用片上资源的底层硬件结构。如果片上资源有余,则最大化循环展开,增加程序并行性。如果面积资源有余,而存储资源不足,本文将输入数组沿水平方向分块,并实现块内部的数据流水化调度,以尽可能的减少重复访问片外存储系统的次数。实验证明,本文提出的设计空间探索方法,可以将片上资源利用率提高到85%以上,同时本文的阵列分块方法与相关工作相比,可以将访存次数降低2%到20%。
     在一些大规模应用中,存在大量包含多个循环基本块的程序,由于片上资源有限,并不能将这些循环基本块同时映射到目标芯片上。在这种情况下,如果为每个循环基本块设计一个专用IP核显然是不实际的。本文在片上资源受限的情况下,针对多循环程序设计了一个参数化的流水线模板,该模板结构对特定目标应用中所有循环基本块通用,能够实现对所有循环基本块的顺序映射。该模板根据目标应用需求和片上资源数量确定底层运算单元的配置,并基于软件流水的迭代模调度思想和ShiftQ体系结构模型,实现对各个循环基本块的指令调度和中间暂存寄存器自动生成。实验表明,针对每个循环基本块,本文设计的流水线模板能达到与专用硬件结构相当的执行节拍,同时本文提出的通用模板结构简化了为每个循环设计专用IP这一过程,降低了设计复杂度,缩短了设计周期。
     综上所述,本文面向滑动窗口应用,研究其IP核的高级综合技术,对存储结构模型、RTL级硬件描述文件自动生成和两种情况下的设计空间探索方法等问题提出了有效的解决方案,对于推进应用类IP核高级综合技术的研究和实用化具有一定的理论意义和应用价值。
In recent years, with the rapid development of IC (integrate circuit) design, the technology of system-on-chip (SoC) has been widely used and increasingly involoved in many fields of electronic technology. In fact, SoC has become a trend of current VLSI (very large scale integration) design.
     The IP (intellectual property) core is the basis and kernel of SoC design. Designers of SoC try to reuse existing IP cores as much as possible to finish the whole project simply by getting them together. These IP cores oriented at special applications embody the innovation of SoC and are also a key factor to the design speed. The HLS (high level synthesis) of IP core raise the level of design from transforming behavior-level description to structure-level, even layout description. HLS can help the designers be released from the complicated hardware design and focus on the high level system design which increases the efficiency and validity of SoC design, and reduces the cost at the same time. As a result, this technology has got much recognition from academe and industry, since it is brought forward and will be promising in the future.
     Of particular interests to this paper are sliding-window applications, which is widely used in signal, image and video processing and requires much computation and data manipulation. Many HLS systems start with this kind of application because of its particularity of memory accessing. Unfortunately, there are still various limitations of current works. Some of them do not put forward the memory architecture definitely, some do not realize data reuse adequately, some use large numbers of memory elements and registers, and some of them do not discuss the problem of design space exploration. We deeply study some key problems in HLS of IP core for sliding-window operations in this thesis which is outlined as followed.
     Aiming at the inherent characteristics of sliding-window operations and the limitation of current works, we propose a parameterized memory architecture to generate the hardware frames for all sliding-window applications automatically. The object of our work is to realize data reuse as fully as possible, so as to reduce the number of memory accesses and speedup the execution. A three levels memory structure is designed to realize inner-loop and outer-loop data reuse, and at the same time shifted registers are used to make hardware design simpler. The architecture is decided by some parameters, the values of which are achieved from the compiler. We proposed the parameters's generation algorithm according to different kinds of data reuse. Compared to related works, our approach which uses only a small number of memory elememts and registers can reduce the execution clock cycles by 2.13X and up to 3.8X, and enhance the frequency from 69MHz to more than 200MHz.
     Based on the parameterized memory architecture, we study the generation of RTL level hardware description, the aim of which is to generate Verilog code of IP core automatically. There are three parts of work: automatic generation of controllers, automatic generation of pipelined operations and generation of holistic encapsulation module. Firstly, the compiler partitions the source codes into two parts: control cell and operation cell. The control cell is analyzed in the compiler, then the value of some parameters are obtained, including the information of loop (the initial value, end value and step-length value of the loop) and the information of data reuse. A algorithm of controllers' generation is presented in this paper, and the controllers can be generated automatically according to these parameters. The operation cell is disposed in the compiler via a series of steps: defining data structure, analyzing dependency, then the description of data dependence flow is created. Based on it, we partition the datapath into pipelined stages, and express the source program in a new IR (intermediate representation). And then, the pipelined operations are generated. Finally, the holistic encapsulation module integrates the controller module, operation module and Ram module etc, and realize the RTL level hardware description's generation. Our approach can avoid the complexity and inefficient of handiwork, and the result is comparatively better.
     Then, this paper studies the design space exploration technology further according to the sufficiency of resources on chip. We present a design space exploration approach when the resources on-chip is abundant, the aim of which is to use the resources completely, increase parallelism, and reduce the clock cycles of execution. By finding three upper bounds according to area constraints (which is scaled by the number of logic operation units), memory bandwidth constraints and on-chip memory constraints, the block structure of the design, which can fully utilized the available resources on the board is determined. Loop unrolling is adopted as much as possible when the area on-chip is abundant. The input data array is partitioned into several pieces horizontally once the memory elements are insufficient. And the data in a piece is processed in pipeline in order to reduce the number of memory accesses as many as possible. Experiments show that the efficiency of memory using can increase to 85% and compared to current work, the number of memory accesses can reduce by 2% to 20%.
     There are some large applications which consist of many loop nests. Map these loop nests in an application onto a target chip maybe impractical because of the area limitation on-chip. Traditional method of designing special IP core for every loop nest is awkward. This paper presents a pipelined template, which is universal to all loop nests in an application. These loop nests can be executed on the template orderly. We decide the number of FUs (function units) according to the resources on-chip and the character of specific application. Based on the iterative modulo scheduling of software pipelinging and the ShiftQ architecture, we schedule the instructions of each loop nest and realize the automatic generation of the registers which are used to keep the intermediate results. Experiments show that the pipelined template can achieve a comparative execution cycles for a loop comparing with the special hardware, and at the same time our approach save the time of designing specific IP core for every loop nests.
     In summary, our works study the HLS of IP core for sliding-window operations, present solutions to several key problems of memory architecture, hardware description code generation and design space exploration of two situations. Our works have academic and practical value for advancing the theory and practicability of HLS of IP core for specific applications.

引文

[1]“32位可配置系统芯片Triscend A7系列,”今日电子,vol.1,pp.14-15.2001.
    [2]C.-C.P.C.Chen,E.cheng,"Future SoC design challenges and solutions,"proceedings of International Symposium on Quality Electronic Design(ISQED),Washington,DC,USA,2002.
    [3]梅杰,“基于SOPC的USB IP核设计,”控制理论与控制工程,硕士学位论文,上海大学,2004.
    [4]王道宪,刘丽,SoC原理、实现与应用.国防工业出版社,2004.
    [5]陈玉梅,”面向SoC的UART及DMA控制器IP软核的设计,”微电子学与固体电子学,硕士学位论文,山东大学,2007.
    [6]K.Keutzer,R.Newton,J.Rabaey,and A.Sangiovanni-Vincentelli,"System-Level Design:Orthogonalization of Concerns and Platform-Based Design," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,vol.19,pp.1523-1543,2000.
    [7]魏少军,”SoC设计方法学,”电子产品世界,vol.5,pp.36-38,2001.
    [8]章立生,韩承德,”SOC芯片设计方法及标准化,”计算机研究与发展,vol.39,pp.1-8,2002.
    [9]陈岚,唐志敏,“单片系统(SOC)设计技术,”计算机研究与发展,vol.39,pp.9-16.2002.
    [10]吴强,边计年,“面向系统芯片(SOC)的软/硬件协同设计系统研究与开发,”北京:清华大学博士生选题研究报告,2002.
    [11]R.Ernst,"Codesign of Embedded Systems:Status and Trends," IEEE Design and Test of Computers,vol.15,pp.45-54,1998.
    [12]W.Savage,J.Chilton,and R.Camposano,"IP reuse in the system on a chip era," proceedings of the 13th international symposium on System synthesis (ISSS),Madrid,Spain,2000.
    [13]R.K.Gupta and J.Allen,"Co-Synthesis of Hardware and Software for Digital Embedded Systems"(Kluwer International Series in Engineering and Computer Science).Kluwer Academic 1995.
    [14]S.Kohara,N.Tomono,J.Uchida,Y.Miyaoka,N.Togawa,M.Yanagisawa,and T.Ohtsuki,"An interface-circuit synthesis method with configurable processor core in IP-based SoC designs," proceedings of Asia and South Pacific Conference on Design Automation(ASP-DAC),Pacifico Yokohama,Yokohama,Japan,2006.
    [15]薛严冰,徐晓轩,“基于IP的系统芯片SoC设计,”信息技术,vol.10,pp. 23-25.2004.
    [16]http://www.arm.com/.
    [17]http://www.mips.com/.
    [18]吴登峰,“龙芯SoC的IP集成技术的研究,”计算机系统结构,硕士学位论文,中科院计算所,2004.
    [19]任爱锋,初秀琴,常存,孙肖子,殷勤业,基于FPGA的嵌入式系统设计.西安电子科技大学出版社.2004.
    [20]张繁,刘笃仁,“SoC面对的技术挑战.”今日电子,vol.12,2001.pp.14-15.
    [21]计世网主页,“面向SoC的系统级设计方法与技术研究进展,”http://www.chinaecnet.com/mkt/qs054334.asp,2005.
    [22]S.Tosun,N.Mansouri,E.Arvas,M.Kandemir,and Y.Xie,"Reliability-Centric High-Level Synthesis," proceedings of Design,Automation and Test in Europe (DATE),Munich,Germany,2005.
    [23]R.Cravotta,”亲自实践项目:系统性能加速的方法,”电子设计技术,vol.12,pp.14-16,2005.
    [24]H.Yu and M.Leeser,"Optimizing Data Intensive Window-based Image Processing on Reconfigurable Hardware Boards," proceedings of IEEE Workshop on Signal Processing Systems(SIPS),Athens,Greece,2005.
    [25]王冠军,马光胜,刘晓晓,李东海,“高级综合十年进展,”计算机科学,vol.34,pp.4-6,2007.
    [26]S.A.Edwards,"The Challenges of Hardware Synthesis from C-Like Languages," proceedings of the conference on Design,Automation and Test in Europe(DAC),San Diego,CA,USA,2005.
    [27]C.E,R.Stroud,R.Munoz,and D.A.Pierce,"Behavioral model synthesis with cones," IEEE Design and Test of Computers,vol.5,pp.22-30,1998.
    [28]D.C.Ku,G.De,and M.HardwareC,"HardwareC:A language for hardware design," Computer Systems Lab,Stanford University,California,August 1990.
    [29]D.Galloway,"The Transmogrifier C hardware description language and compiler for FPGAs," proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines(FCCM),Napa,California,1995.
    [30]http://www.systemc.org/.
    [31]P.Schaumont,S.Vernalde,L.Rijnders,M.Engels,and I.Bolsens,"A programming environment for the design of complex high speed ASICs,"proceedings of the 35th Design Automation Conference(DAC),San Francisco,California,1998.
    [32]L.Richard J,D.N.Serpanos,and W.H.Wolf,"PDL++:an optimizing gernerator language for register transfer design," proceedings of International Symposium on Circuits and Systems(ISCAS),New Orleans,Louisiana,1990.
    [33] D. Soderman and Y. Panchul, "Implementing C algorithm in reconfigurable hardware using C2Verilog," proceedings of the IEEE symposium on FPGAs for Custom Computing Machines (FCCM), Los Alamitos, CA, 1998.
    [34] T. Kambe, A. Yamada, K. Nishida, K. Okada, M. Ohnishi, A. Kay, P. Boca, V. Zammit, and T. Nomura, "A C-based Synthesis system, Bach, and its application," proceedings of the Asia South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, 2001.
    [35] D. D. Gajski, J. Zhu, R. D¨omer, A. Gerstlauer, and S. Zhao, SpecC: Specification Language and Methodology. Springer, 2000.
    [36] O. Mencer, D. J. Pearce, L. W. Howes, and W. Luk, "Design space exploration with A Stream Compiler," proceedings of 2003 IEEE International Conference on Field-Programmable Technology (FPT), Japan, 2003.
    [37] Celoxica, "Handel-C Language Reference Manual," 2003.
    [38] J. G. F. Coutinho, W. Luk , "Source-directed transformations for hardware compilation," proceedings of 2003 IEEE International Conference on Field-Programmable Technology (FPT), Japan,2003.
    [39] M. Budiu and S. C. Goldstein, "Compiling application-specific hardware," proceedings of the 12th International Conference on Field Programmable Logic and Applications (FPL), Montpellier, France, 2002.
    [40] M. Weinhardt and W. Luk, "Pipeline vectorization," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, pp. 234-248,2001.
    [41] J. Frigo, M. Gokhale, and D. Lavenier, "Evaluation of the StreamsC C to FPGA Compiler: An Applications Perspective," proceedings of the 9th ACM International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, 2001.
    [42] M. B. Gokhale, J. M. Stone, J. Arnold, and M. Lalinowski, "Stream-oriented FPGA computing in the Streams-C high level language," proceedings of 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa Valley, California, 2000.
    [43] M. B. Gokhale and J. M. Stone, "NAPA C: compiling for a hybrid RISC/FPGA architecture," proceedings of IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), Napa Valley, California, 1998.
    [44] "SUIF Compiler System," http://suif.stanford.edu, 2004.
    [45] J. L. Tripp, P. A. Jackson, and B. L. Hutchings, "Sea Cucumber:A Synthesizing Compiler for FPGAs," proceedings of 12th International Conference on Field-Programmable Logic and Applications (FPL), Montpellier, France, 2002.
    [46] K. S. Hemmert, J. L. Tripp, B. L. Hutchings, and P. A.Jackson, "Source Level Debugger for the Sea Cucumber Synthesizing Compiler," proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa Valley, California, 2003.
    [47] S. Gupta, N. D. Dutt, R. K. Gupta, and A. Nicolau, "SPARK: a high-level synthesis framework for applying parallelizing compiler transformations," proceedings of the 16th International Conference on VLSI Design (VLSI), Paul Chow. 2003.
    [48] http://www.mentor.com/products/esl/high_level_synthesis/catapult_synthesis/in dex.cfm.
    [49] H. Ziegler and M. Hall, "Evaluating Heuristics in Automatically Mapping Multi-Loop Applications to FPGAs," proceedings of the Thirteen ACM International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, Califonia, USA, 2005.
    [50] J. Babb, M. Rinard, A. Moritz, W. Lee, M. Frank, R. Barua, and S. Amarasinghe, "Parallelizing Applications into Silicon," proceedings of the IEEE Symposium on FPGA for Custom Computing Machines (FCCM), Napa Valley, California, 1999.
    [51] B. So, M. W. Hall, and P. C. Diniz, "A Compiler Approach to Fast Hardware Design Space Expiration in FPGA-based Systems," proceedings of the 2002 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Berlin,Germany, 2002.
    [52] P. Diniz, M. Hall, J. Park, B. So, and H. Ziegler, "Bridging the gap between compilation and synthesis in the DEFACTO system," proceedings of 14th Workshop Languages and Compilers for Parallel Computing, Cumberland Falls, KY, USA, 2001.
    [53] Z. Guo, B. Buyukkurt, W. Najjar, and K. Vissers, "Optimized Generation of Data-path from C Codes for FPGAs," proceedings of International ACM/IEEE Design, Automation and Test in Europe Conference, Munich, Germany, 2005.
    [54] Z. Guo, W. Najjar, and B. Buyukkurt, "Efficient Hardware Code Generation for FPGAs," ACM Transaction on Architecture and Code Optimizations (TACO), vol 37. pp. 245-259. 2006.
    [55] Z. Guo and W. A. Najjar, "A Compiler Intermediate Representation for Reconfigurable Fabrics," proceedings of 16th International Conference on Field-Programmable Logic and Applications(FPL), Montpellier, France, 2006.
    [56] http://www.accelchip.com/.
    [57] http://www.synplicitv.com/.
    [58] http://www.celoxica.com/.
    [59] http://www.fbrteds.corn/index.asp?bhcp=1.
    [60] http://www.impulsec.com/.
    [61]http://www.synfora.com/.
    [62]http://www.poseidon-systems.com/.
    [63]http://www.coware.com/.
    [64]http://www.stretchinc.com/.
    [65]http://www.criticalblue.com/.
    [66]http://www.mentor.com/.
    [67]D.B.Loveman,"Program Improvement by Source-to-Source Transformation,"Journal of the ACM,vol.24,pp.121-145,1977.
    [68]L.Lamport,"The Parallel Execution of Do-Loops," Communications of the ACM,vol.17,pp.83-93,1974.
    [69]Mandal,L.,Banerjee,U.,Hartenstein,V."Evidence for a hemangioblast and similarities between lymph gland hematopoiesis," in Drosophila and mammalian AGM,Nature Genetics,vol.36,pp.1019-1023,2004..
    [70]W.A.Abu-Sufah,D.J.Kuck,and D.H.Lawrie,"On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations," IEEE Transactions on Computers,vol.30,pp.341-356,1981.
    [71]S.Haynal,"Automata-Based Symbolic Scheduling," PhD.thesis,University of California,Santa Barbara,2000.
    [72]G.Lakshminarayana,A.Raghunathan,and N.K.J.Wavesched,"A novel scheduling technique for control-flow intensive designs," IEEE Transactions on CAD,vol.18,pp.505-523,1999.
    [73]S.Gupta,N.Savoiu,N.D.Dutt,R.K.Gupta,and A.Nicolau,"Using global code motions to improve the quality of results for high-level synthesis," IEEE Transaction on CAD of Integrated Circuits and Systems,vol.23,pp.302-312,2004.
    [74]"Machine-SUIF,"http://suif.stanford.edu,2004.
    [75]G.Holloway and M.D.Smith,"The Machine-SUIF SUIFvm Library," Division of Engineering and Applied Sciences,Harvard University,2002.
    [76]W.Wolf,"A decade of hardware/software codesign," IEEE Computer,vol.36,pp.38-43,2003.
    [77]G.De Michell and R.K.Gupta,"Hardware/software co-design," IEEE Design and Test of Computers,vol.85,pp.349-365,1997.
    [78]Arvind,R.S.Nikhil,D.L.Rosenband,and N.Dave,"High-level synthesis:an essential ingredient for designing complex ASICs," proceedings of IEEE/ACM International Conference on Computer Aided Design(ICCAD),San Jose,CA,USA,2004.
    [79]Keh-Yih Su,Ming-Wen Wu,Jing-Shin Chang,"A Corpus-Based Approach to Automatic Compound Extraction," the 32~(nd) Annual Meeting of the Assiciation for Computational Linguistics(ACL),Las Cruces,New Mexico,USA,1994.
    [80]D.J.Kuck,"The Structure of Computers and Computations," ACM SIGARCH Computer Architecture News,vol.7,pp.27-30,1978.
    [81]J.Ferrante,K.J.Ottenstein,and J.D.Warren,"The program dependency graph and its uses in optimization," ACM Transactions on Programming Languages and Systems,vol.9,pp.319-349,1987.
    [82]A.Dani and J.Getta,"A System Of Operations On Sliding Windows,"proceedings of International Conference on Advanced Computing and Communications(ADCOM),Karnataka,India,2006.
    [83]托马斯.布劳恩,并行图像处理.西安交通大学出版社,2003.
    [84]何斌,马天予,王运坚,朱红莲,Visual C++数字图像处理.人民邮电出版社,2001.
    [85]http://euler.slu.edu/～fritts/mediabench/mbl/.
    [86]Chunho Lee,Miodrag Potkonjak,William H.Mangione-Smith,"MediaBench:A Toll for Evaluating and Synthesizing Multimedia and Communications Systems," the 13~(th) Annual IEEE/ACM International Symposium on Microarchitecture,North Carolina,USA,1997.
    [87]L.P.Carloni,F.D.Bernardinis,A.Sangiovanni-Vincentelli,and M.Sgroi,"The Art and Science of Integrated Systems Design," proceedings of the 28th European Solid-State Circuits Conference,Firenze,Italy,2002.
    [88]J.Xu and W.Wolf,"Platform-Based Design and the First Generation Dilemma,"proceedings of The 9th IEEE & Design Automation Technical Committee (DATC) Electronic Design Processes(EDP) Workshop,Monterey,California,2002.
    [89]http://focus.ti.com/.
    [90]http://www.pnx1300.com/.
    [91]张舸,陈晓署,“Triscend SoC芯片E5及其在通信系统中的应用,”无线电工程,vol.5,pp.34-36.2003.
    [92]张晨曦,王志英,张春元,戴葵,朱海滨,计算机体系结构.高等教育出版社,2000.
    [93]J.-C.Tuan,T.-S.Chang,and C.-W.Jen,"On the data reuse and memory bandwidth analysis for full-searchblock-matching VLSI architecture," IEEE Transactions on Circuits and Systems for Video Technology,vol.12,pp.61-72,2002.
    [94]J.R.Allen and K.Kennedy,"Conversion of control dependence to data dependence," proceedings of 10th ACM Symposium on Principles of Programming Languages(POPL),Austin,Texas,1983.
    [95]J.C.Park and M.S.Schlansker,On predicated execution.HP Labs,1991.
    [96]M.Heffernan and K.Wilken,"Data-Dependency Graph Transformations for Instruction Scheduling," Journal of Scheduling. vol. 8, pp. 427-451, 2005.
    [97] A. Aleta, J. M Codina, J. Sanchez, A. Gonzalez, and D. Kaeli, "Exploiting pseudo-schedules to guide data dependence graph partitioning," proceedings of 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT), Charlottesville, Virginia, USA, 2002.
    [98] D. J. Kolson, A. Nicolau, and N. D. Dutt, "Minimization of Memory Traffic in High-Level Synthesis," proceedings of the 31st Conference on Design Automation (DAC), San Diego, California, USA, 1994.
    [99] J. Ng, D. Kulkarni, W. Li, R. Cox, and S. Bobholz, "Inter-procedural loop fusion, array contraction and rotation," proceedings of The 12th International Conference on Parallel Architectures and Compilation Techniques (PACT), New Orleans, LA, 2003.
    [100] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, L.-L. A., P. R. Mattson, and J. D. Owens, "A bandwidth-efficient architecture for media processing," proceedings of the 31~(st) Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dallas, Texas, USA, 2001.
    [101] B. So and M. W. Hall, "Increasing the Applicability of Scalar Replacement," proceedings of the 13th International Conference on Compiler Construction (CC), Barcelona, Spain, 2004.
    [102] B. So, M. W. Hall, and H. E. Ziegler, "Custom Data Layout for Memory Parallelism," proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization(CGO), San Jose, CA, USA, 2004.
    [103] J. Park and P. C. Diniz, "Synthesis of pipelined memory access controllers for streamed data applications on FPGA-based computing engines," proceedings of the 11th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, 2003.
    [104] P. C. Diniz and J. Park, "Data Search and Reorganization Using FPGAs: Application to Spatial Pointer-based Data Structures," proceedings of the 11th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, 2003.
    [105] A. P. and K. K., "Automatic Local Memory Architecture Generation for Data Reuse in Custom Data Paths," proceedings of the 2005 ACM symposium on Applied computing (SAC), Santa Fe, New Mexico, USA, 2004.
    [106] I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt, "Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies," proceedings of Design, Automation and Test in Europe (DATE), Paris, France, 2004.
    [107] P. C. Diniz and J. Park, "Automatic Synthesis of Data Storage and Control Structures for FPGA-Based Computing Engines," proceedings of the 8th IEEE ______Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa Valley, CA, 2000.
    [108] N. Baradaran, P. C. Diniz, and J. Park, "Extending the Applicability of Scalar Replacement to Multiple Induction Variables," proceedings of the 17th International Workshop on Languages and Compulers for High Performance Computing (LCPC), West Lafayette, IN, USA, 2004.
    [109] Z. Guo, B. Buyukkurt, and W. Najjar, "Input Data Reuse In Compiling Window Operations Onto Reconfigurable Hardware," proceedings of the ACM Symposium On Languages, Compilers and Tools for Embedded Systems (LCTES), Washington, DC, 2004.
    [110] R. Johnson and K. Pingali, "Dependence-Based Program Analysis " proceedings of SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Albuquerque, New Mexico, 1993.
    [111] W. Najjar, W. Bohm, B. Draper, J. Hammes, R. Rinker, R. Beveridge, M. Chawathe, and C. Ross, "From Algorithms to Hardware - A High-Level Language Abstraction for Reconfigurable Computing," IEEE Computer Spciety vol. 36, pp. 63-69,2003.
    [112] M. Budiu and S. C. Goldstein, "Pegasus: An efficient intermediate representation," Technical Report, CMU, May 2002.
    [113] H. Yu and M. Leeser, "Automatic Sliding Window Operation Optimization for FPGA-Based," proceedings of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, USA, 2006.
    [114] X. Liang, J. Jean, and k. Tomko, "Data Buffering and Allocation in Mapping Generalized Template Matching on Reconfigurable Systems," The Journal of Supercomputing, Special Issue on Engineering of Reconfigurable Hardware/Software Objects, vol 19, pp. 77-91, 2001.
    [115] X. Liang and J. S.-N. Jean, "Mapping of Generalized Template Matching onto Reconfigurable Computers," IEEE Transactions on VLSI Systems, vol. 11, pp. 485-498,2003.
    [116] J. Allen and K. Kennedy, "Automatic Loop Interchange," proceedings of the SIGPLAN Symposium on Compiler Construction, Montreal, Ca, 1984.
    [117] M. Lam, "Software Pipelining: An Effective Scheduling Technique for VLIW Machines," proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Las Vegas, Nevada, 1988.
    [118] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan, "Software pipelining," ACM Computing Surveys, vol. 27, pp. 367-432, 1995.

    [119] B. R. Rau, "Iterative Modulo Scheduling: An Algorithm For Software Pipelining Loops," proceedings of the 27~(th) Annual International Symposium on Microarchitecture (MICRO), San Jose, California, USA, 1994.
    [120] S. M. Moon and K. and Ebciogu, "An efficient resource-constrained global scheduling technique for superscalar and VLIW prcessors," proceedings of 25th Annua International Symposium on Microarchitecture (MICRO), Portland, Oregon, 1992.
    [121] Aiken.A. and A. Nicolau, "A realistic resource-constrained software pipelining algorithm," In advances in Languages and Compilers for Parallel Processing, vol. 7, pp. 274-290, 1991.
    [122] R. A. Huff, "Lifetime-sensitive Modulo Scheduling," proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI), Albuquerque, New Mexico, United States 1993.
    [123] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero, "Swing Modulo Scheduling: A Lifetime Sensitive Approach," proceedings of the Conference on Parallel Architecture and Compilation Techniques, Boston, MA, USA, 1996.
    [124] A. E. Eichenberger and E. S. Davison, "Stage Scheduling: A Technique to Reduce the Register Requirements of a Modulo Schedule," proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, 1995.
    [125] C. T. Hwang, J. H. Lee, and Y. C. Hsu, "A formal approach to the scheduling problem in high level synthesis," IEEE Transaction on CAD, vol. 10, pp. 464-475, 1991.
    [126] A. E. Eichenberger and E. S. Davison, "Efficient formulation for optimal modulo scheduler," proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Las Vegas, Nevada, 1997.
    [127] E. R. Altman and G. A. Gao, "Optimal modulo scheduling through enumeration," International Journal of Parallel Programming, vol. 26, pp. 313-344, 1998.
    [128] K. Fan, M. Kudlur, H. Park, and S. A. Mahlke, "Cost Sensitive Modulo Scheduling in Loop Accelerator Synthesis System," proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Barcelona, Spain, 2005.
    [129] W. Sun, M. J. Wirthlin, and S. Neuendorffer, "FPGA Pipeline Synthesis Design Exploration Using Module Selection and Resource Sharing," IEEE Transaction on CAD, vol. 26, pp 138-145. 2007.
    [130] J. C. Dehnert and R. A. Towle, "Compiling of the Cydra 5," Journal of Supercomputing, vol. 7, pp. 181-228, 1993.
    [131] M. Hagog and A. Zaks, "Swing Modulo Scheduling for GCC," Technical Report, IBM Haifa Labs, 2004.

    [132] S. Aditya and M. S. Schlansker, "ShiftQ: A buffered interconnect for custom loop accelerators," proceedings of the International Conference on Compilers,Architecture and Synthesis for Embedded Systems(CASES),Atlanta,Georgia,2001.
    [133]V.Kathail,S.Aditya,R.Schreiber,B.R.Rau,D.C.Cronquist,and M.Sivaraman,"PICO:Automatically Designing Custom Computers," IEEE Computer,vol.35,pp.39-47,2002.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700