面积带宽优化的嵌入式GPU可编程着色器体系结构研究

英文题名：An Area and Bandwidth Efcient Programmable Shader Architecture for Embedded Graphics Processing Units
作者：常轶松
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：嵌入式GPU可编程着色器 ; 系统级仿真平台 ; 传输触发结构 ; 顶点高速缓存Cache
英文关键词：Embedded Graphics Processing Unit (GPU) ; Programmable Shader ; System-level Simulation Platform ; Transport Triggered Architecture ; Vertex Cache
学位年度：2013
导师：孙济洲
学科代码：081203
学位授予单位：天津大学
论文提交日期：2013-06-01

摘要

随着VLSI工艺水平的不断提高与应用需求的不断增长，在系统级芯片中集成基于多统一着色器的嵌入式GPU已成为高端移动终端设备的重要发展趋势。但由于芯片面积的严格约束，嵌入式GPU中可容纳的可编程着色器核心数量极为有限。这就要求在体系结构设计中必须有效提升单着色器的计算性能，并保证较小的面积开销；另一方面，嵌入式GPU在绘制过程中需要频繁访问片外图形数据存储资源，造成极高的总线数据访问带宽，增加了嵌入式GPU的系统功耗。因此如何对可编程着色器的逻辑面积和数据访问带宽进行优化成为嵌入式GPU体系结构研究的重要方向。本文将针对上述问题，从多核嵌入式GPU系统级建模方法、面积优化的单着色器运算单元通路与体系结构设计、带宽优化的多着色器顶点缓存结构等方面开展研究工作，为未来多核嵌入式GPU体系结构的研究与设计提供理论和技术基础。
     首先，本文提出一种基于混合建模技术的嵌入式GPU高层次全系统仿真平台。为了有效提升复杂系统软件的仿真速度，提出了基于QEMU虚拟机的微处理器指令集仿真器，并利用SystemC事务级模型对系统级芯片内部互连结构进行建模，有效提升系统仿真效率。之后提出一种基于基于片内数据缓冲区的多统一着色器的嵌入式GPU基础体系结构，并利用周期级建模的方法对其微结构细节特征进行描述。最后将周期级模型与SystemC事务级硬件模型进行整合，从而为本文后续的研究工作提供基础实验平台。
     其次，本文提出了可编程着色器内部面积优化的浮点运算单元数据通路。首先针对浮点向量运算的特点，提出了一种多功能统一浮点向量运算单元结构。通过对已有向量内积运算单元关键硬件模块进行向量化复用，使其支持基本向量运算类指令的处理，并在保证计算性能的同时尽可能降低逻辑面积开销。以此为基础，通过在着色器内部复用空闲向量运算单元，完成标量超越函数二次多项式近似的计算，进一步降低浮点标量特殊功能单元的逻辑开销。
     第三，本文以传输触发结构为基础，从性能和面积开销两个方面对单着色器体系结构进行优化。基于传输触发结构下细粒度数据传输和体系结构层次可见的数据旁路，减少着色指令执行过程中冗余结果数据的写回操作，从而有效发掘着色器内部的指令级并行性，并减少其数据通路中互连结构的设计复杂度。之后以顶点着色器为例，对基于传输触发的可编程着色器微体系结构进行详细设计。通过融合传输触发和顶点处理的特点，定制了着色器微指令集；并分别通过配置运算单元数目和改进寄存器端口及写回机制，达到进一步降低面积开销的目的。最后，本文对该着色器进行了硬件设计和FPGA原型系统搭建，验证了本文所提出的可编程着色器体系结构具有较高的计算性能并能够减少面积开销，从而有效提升着色器的面积效能。
     最后，本文提出一种面向图元的顶点拾取策略，有效消除在多着色器上运行的顶点数据任务间的顺序依赖性。在此基础上，通过改进原有面向单顶点着色器的顶点Cache结构，对多着色器结构下的顶点数据访问带宽进行优化。在进行顶点着色器前，使用Pre-TnL顶点Cache与面向图元顶点拾取策略相结合，缓存最近拾取的顶点数据，降低其总线访问频度；之后通过设计一种tag部分与数据存储部分分离的Post-TnL顶点Cache结构，有效缓存多着色器最近提交的顶点处理结果。最后通过在多核嵌入式GPU任务调度器中设计顺序提交控制逻辑，保证分离Cache缓存结果的正确性。仿真结果表明，分离Post-TnL顶点Cache可以有效减少重复处理的顶点数目，进一步降低顶点访问带宽。
     仿真评估和硬件实现验证结果表明，本文提出的嵌入式GPU可编程着色器体系结构设计方法可以实现对面积开销和顶点数据访问带宽的优化，为未来针对基于多统一着色器嵌入式GPU体系结构的设计与实现进行了有益的探索。
As the development of silicon technology and application requirement, embeddedgraphics processing units (GPU) with multiple unified shaders have been widely integrat-ed into System-on-Chip (SoCs) for high-end mobile devices. However, the number ofprogrammable shader cores in embedded GPU architecture is restricted by silicon areacost so that it is required to improve performance while maintain area efciency duringshader architecture design. Moreover, a large amount of graphics data located in externalmemory should be accessed in rendering, leading to a higher bus bandwidth and evenhuge power dissipation in embedded GPUs. Therefore, it is essential to optimize areacost and data bandwidth in programmable shader architecture. In this dissertation, someresearch works focusing on both problems are proposed, including modeling method ofmulti-core embedded GPU architecture, area efcient arithmetic datapath and processorarchitecture for shaders and bandwidth optimized vertex cache hierarchy in multi-shaderarchitecture. The main target of the proposed works is to provide fundamental theory andtechnology for future research and design of multi-core embedded GPU architecture.
     First, a high-level, full system simulation platform based on hybrid modeling meth-ods for embedded GPUs is proposed. To avoid slow simulation speed of complex systemsoftware, an instruction-set simulator based on QEMU is proposed. Additionally, inter-connection network and device interfaces in SoC are modeled in SystemC-TLM to im-prove simulation efciency. After that, we introduce a basic embedded GPU architecturebased on multiple unified shaders and internal data bufers. To describe its micro archi-tecture, a detailed cycle-level model is proposed and combined with the SystemC-TLMhardware model to provide a fundamental experiment platform for our research works.
     Second, area efcient floating-point (FP) function units in shader are proposed. Atfirst, a unified, multi-functional FP vector arithmetic unit (VAU) is implemented. To sup-port basic vector operations, the main hardware blocks in the conventional vector produc-tion unit is vectorized and multiplexed, which can efectively maintain performance andreduce huge additional area cost. Based on VAU, we introduce a method to use idle VAUsin shader architecture to calculate quadratic approximation, which can further reduce thearea cost of elementary transcendental function unit.
     Third, a high performance, area efcient programmable shader architecture basedon transport triggered architecture (TTA) is proposed. With the help of fine-grained datatransport and visible bypass at micro architecture level, redundant write back of instruc- tion results can be avoided, which is benefit for exploitation of instruction level parallelis-m. Then a detailed TTA-like vertex shader micro architecture is implemented. Combiningboth features of TTA and vertex processing, we define a customized shading instructionset. By configuring the number of functional units and optimizing the design of registerport and result writeback scheme, area cost of the implemented vertex shader can be fur-ther reduced. We finally implement the proposed vertex shader in both ASIC design andFPGA prototype platform to evaluate that the proposed TTA-like shader architecture canprovide high performance with reduced area cost, leading to significant area efciency forembedded platform.
     Finally, we introduce a primitive-oriented vertex fetch (POVF) scheme to eliminatesequential dependencies among diferent vertex batches in the multiple shader architec-ture. Based on it, we try to reduce vertex data fetching bandwidth by optimizing vertexcache hierarchy for multi-shader architecture. To reduce bus access frequency for vertexdata, a pre-TnL vertex cache combined with POVF scheme is proposed to hold recentlyfetched vertex data before shading. On the other hand, a tag-SRAM separated post-TnLvertex cache is also implemented to bufering recently shaded vertex result data at difer-ent stages of vertex processing. To guarantee valid vertex cache results, hardware logicfor in-order submission of vertex batches is also implemented in the task scheduler of themulti-shader embedded GPU architecture. Simulation results shows that the number ofredundant vertex data processing and vertex bandwidth can be reduced using the separatedpost-TnL vertex cache.
     Simulation and implementation results show that the area cost and vertex fetchingbandwidth can be efectively optimized using the micro architecture design methods pro-posed in this dissertation, which is a beneficial exploration for research and design ofembedded GPU architecture based on multiple unified shaders in future.

引文

[1] Want R. When Cell Phones Become Computers[J]. IEEE Pervasive Computing,2009,8(2):2–5.
    [2] Akenine-Mo¨ller T, Stro¨m J. Graphics Processing Units for Handhelds[J]. Proceedings of theIEEE,2008,96(5):779–789.
    [3] Carrtero M, Ortiz A, Oyarzun D, et al. Multiplatform3-D Graphics: Interactive Content Man-agement for Multiple Devices[J]. IEEE Vehicular Technology Magazine,2010,5(1):24–30.
    [4] ARM Inc. Mali Graphics[EB/OL],2011. http://www.arm.com/products/multimedia/mali-graphics-hardware/index.php.
    [5] Imagination Technology. PowerVR Graphics[EB/OL],2010. http://www.imgtec.com/powervr/powervr-graphics-technology.asp.
    [6] Woo J H, Sohn J H, Nam B G, et al. Mobile3D Graphics SoC[M]. Singapore: John Wiley,ISBN:978-0-470-82377-4,2010.
    [7] Woo J H, Sohn J H, Yoo H J. A152-mW Mobile Multimedia SoC With Fully Programmable3-D Graphics and MPEG4/H.264/JPEG[J]. IEEE Transactions on Very Large Scale Integration(VLSI) Systems,2009,17(9):1260–1266.
    [8] Kim Y J, Kim H E, Kim S H, et al. Homogeneous Stream Processors With Embedded Spe-cial Function Units for High-Utilization Programmable Shaders[J]. IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems,2012,20(9):1691–1704.
    [9] Imagination Technology. SGX Graphics IP Core[EB/OL],2008. http://www.imgtec.com/powervr/sgx.aspp.
    [10] ARM Inc. Mali-T604[EB/OL],2012. http://www.arm.com/products/multimedia/mali-graphics-plus-gpu-compute/mali-t604.php.
    [11] Akenine-Mo¨ller T, Haines E, Hofman N. Real-Time Rendering, Third Edition[M]. MA, U.S.A:Wellesley, ISBN:978-1-56881-424-7,2008.
    [12] Phong B T. Illumination for Computer Generated Images[J]. Comm. ACM,1975,18(6):311–317.
    [13] Woo R, Choi S, Sohn J H, et al. A210mW Graphics LSI Implementing Full3D Pipeline with264Mtexels/s Texturing for Mobile Multimedia Applications[J]. IEEE Journal of Solid-StateCircuits,2004,39(2):358–367.
    [14] Imagination Technology. PowerVR MBX HR-S[EB/OL],2002. http://www.imgtec.com/powervr/products/Graphics/MBX/index.asp?Page=3.
    [15] khronos Group. OpenGL ES-The Standard for Embedded Accelerated3D Graphics[EB/OL],2007. http://www.khronos.org/opengles/.
    [16] Hennessy J L, Patterson D A. Computer Architecture, Fifth Edition: A Quantitative Ap-proach[M]. MA, U.S.A: Elsevier, ISBN:978-0123838728,2012.
    [17] Lindholm E, Kilgard M, Moreton H. A User-Programmable Vertex Engine[C]. Proc. SIG-GRAPH’01,2001:149–158.
    [18] ARM Inc. Mali-200[EB/OL],2010. http://www.arm.com/products/mali-200.php.
    [19] Sohn J H, Woo J H, Lee M W, et al. A155-mW50-Mvertices/s Graphics Processor With Fixed-Point Programmable Vertex Shader for Mobile Applications[J]. IEEE Journal of Solid-StateCircuits,2006,41(5):1081–1091.
    [20] Kim H, Nam B G, Yoo H J. A231-MHz,2.18-mW32-bit Logarithmic Arithmetic Unit forFixed-Point3-D Graphics System[J]. IEEE Journal of Solid-State Circuits,2006,41(11):2373–2381.
    [21] Montrym J, Moreton H. The GeForce6800[J]. IEEE Micro,2005,25(2):41–51.
    [22] ARM Inc. Mali-400MP[EB/OL],2011. http://www.arm.com/products/multimedia/mali-graphics-hardware/mali-400-mp.php.
    [23] ARM Inc. Mali-450MP[EB/OL],2011. http://www.arm.com/products/multimedia/mali-graphics-hardware/mali-450-mp.php.
    [24] Lindholm E, Nickolls J, Oberman S, et al. NVIDIA Tesla: A Unified Graphics and ComputingArchitecture[J]. IEEE Micro,2008,28(2):39–55.
    [25] Singhal N, Yoo J, Choi H, et al. Implementation and Optimization of Image Processing Al-gorithms on Embedded GPU[J]. IEICE Transactions on Information and Systems,2012, E95-D(5):1475–1484.
    [26] Yoon J S, Kim J H, Kim H E, et al. A Unified Graphics and Vision Processor With a0.89μW/fps Pose Estimation Engine for Augmented Reality[J]. IEEE Transactions on Very LargeScale Integration (VLSI) Systems,2013,21(2):206–216.
    [27] khronos Group. OpenCL-The Open Standard for Parallel Programming of HeterogeneousSystems[EB/OL],2010. http://www.khronos.org/opencl/.
    [28] Wittenbrink C M, Kilgarif E, Prabhu A. Fermi GF100GPU Architecture[J]. IEEE Micro,2011,31(2):50–59.
    [29]魏继增.可配置可扩展处理器关键问题研究[D].天津:天津大学,2010.
    [30] Kameyama M, Kato Y, Fujimoto H, et al.3D Graphics LSI Core for Mobile Phone: Z3D[C].Proc. ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware,2003:60–67.
    [31] Sohn J H, Park Y H, Woo R, et al. Low-power3D Graphics Processors for Mobile Terminals[J].IEEE Communications Magazine,2005,43(12):90–99.
    [32] Sohn J H, Woo R, Yoo H J. A Programmable Vertex Shader with Fixed-Point SIMD Datapath forLow Power Wireless Applications[C]. Proc. SIGGRAPH/Eurographics Workshop on GraphicsHardware,2004:107–114.
    [33] Sohn J H, Woo J H, Lee M W, et al. A50Mvertices/s Graphics Processor with Fixed-PointProgrammable Vertex Shader for Mobile Applications[C]. IEEE Int. Solid-State Circuits Conf.(ISSCC) Dig. Tech. Papers,2005:146–147.
    [34] Kim H, Nam B G, Sohn J H, et al. A231MHz,2.18mW32-bit Logarithmic Arithmetic Unitfor Fixed-Point3D Graphics System[C]. Proc. IEEE Asian Solid-State Circuits Conference (A-SSCC),2005:305–308.
    [35] Nam B G, Kim H, Yoo H J. A Low-Power Unified Arithmetic Unit for Programmable Handheld3-D Graphics Systems[J]. IEEE Journal of Solid-State Circuits,2007,42(8):1767–1778.
    [36] Nam B G, Kim H, Yoo H J. Power and Area-Efcient Unified Computation of Vector andElementary Functions for Handheld3D Graphics Systems[J]. IEEE Transactions on Computers,2008,57(4):490–504.
    [37] Hutchins E, et al. SC10: A Video Processor and Pixel Shading GPU For Handheld Devices[C].Proc. HotChips16,2004.
    [38] Imai M, et al. A109.5mW1.2V600M texels/s3-D Graphics Engine[C]. IEEE Int. Solid-StateCircuits Conf.(ISSCC) Dig. Tech. Papers,2004.
    [39] Arakawa F, et al. An Embedded Processor Core for Consumer Appliances with2.8GFLOPSand36Mpolygons/s[C]. IEEE Int. Solid-State Circuits Conf.(ISSCC) Dig. Tech. Papers,2004.
    [40] Kim D, Chung K, Yu C H, et al. An SoC with1.3Gtexels/s3D Graphics Full Pipeline Enginefor Consumer Applications[C]. IEEE Int. Solid-State Circuits Conf.(ISSCC) Dig. Tech. Papers,2005:190–191.
    [41] Kim D, Chung K, Yu C H, et al. An SoC With1.3Gtexels/s3-D Graphics Full Pipeline forConsumer Applications[J]. IEEE Journal of Solid-State Circuits,2006,41(1):71–83.
    [42] Mu¨ller J M. Partially Rounded Small-Order Approximations for Accurate, Hardware-oriented,Table-based Methods[C]. Proc. IEEE Symp. Computer Arithmetic,2003:114–121.
    [43] Nam B G, Lee J, Kim K, et al. Cost-Efective Low-Power Graphics Processing Unit for Hand-held Devices[J]. IEEE Communications Magazine,2008,46(4):152–159.
    [44] Nam B G, Yoo H J. An Embedded Stream Processor Core Based on Logarithmic Arithmetic fora Low-Power3-D Graphics SoC[J]. IEEE Journal of Solid-State Circuits,2009,44(5):1554–1570.
    [45] Fung W W L, Sham I, Yuan G, et al. Dynamic Warp Formation and Scheduling for Efcient GPUControl Flow[C]. Proc. the Annual International Symposium on Microarchitecture (MICRO),2007:407–418.
    [46] Gebhart M, Johnson D R, Keckler S W. Energy-Efcient Mechanisms for Managing ThreadContext in Throughput Processors[C]. Proc. International Symposium on Computer Architec-ture (ISCA),2011:235–246.
    [47] Tsao Y M, Chang C H, Lin Y C, et al. An8.6mW12.5Mvertices/s800MOPS8.91mm2StreamProcessor Core for Mobile Graphics and Video Applications[C]. Symp. VLSI Circuits (VLSIC)Dig. Tech. Papers,2007:218–219.
    [48] Chien S Y, Tsao Y M, Chang C H, et al. An8.6mW25Mvertices/s400-MFLOPS800-MOPS8.91mm2Multimedia Stream Processor Core for Mobile Applications[J]. IEEE Journal of Solid-State Circuits,2008,43(9):2025–2035.
    [49] Yu C H, Chung K, Kim D, et al. A120Mvertices/s Multi-threaded VLIW Vertex Processor forMobile Multimedia Applications[C]. IEEE Int. Solid-State Circuits Conf.(ISSCC) Dig. Tech.Papers,2006:408–409.
    [50] Amdahl G. Validity of the Single-Processor Approach to Achieving Large-Scale ComputingCapabilities[C]. Proc. Am. Federation of Information Processing Societies Conf.,1967:483–485.
    [51] Yu C H, Chung K, Kim D, et al. An Energy-Efcient Mobile Vertex Processor With MultithreadExpanded VLIW Architecture and Vertex Caches[J]. IEEE Journal of Solid-State Circuits,2007,42(10):2257–2269.
    [52] Yu C H, Chung K, Kim D, et al. A186-Mvertices/s161-mW Floating-Point Vertex Proces-sor With Optimized Datapath and Vertex Caches[J]. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems,2009,17(10):1369–1382.
    [53] Woo J H, Sohn J H, Yoo H J. A195mW,9.1MVertices/s Fully Programmable3-D Graph-ics Processor for Low-Power Mobile Devices[J]. IEEE Journal of Solid-State Circuits,2008,43(11):2370–2380.
    [54] Woo J H, Sohn J H, Yoo H J. A195mW/152mW Mobile Multimedia SoC With Fully Pro-grammable3-D Graphics and MPEG4/H.264/JPEG[J]. IEEE Journal of Solid-State Circuits,2008,43(9):2047–2056.
    [55] Woo J H, Kim H, Yoo H J, et al. A Low-Power Multimedia SoC with Fully Programmable3DGraphics for Mobile Devices[J]. IEEE Computer Graphics and Applications,2009,29(5):82–90.
    [56] Imagination Technology. PowerVR Series5XT IP Core[EB/OL],2010. http://www.imgtec.com/powervr/sgx_series5XT.asp.
    [57] Tsao Y M, Sun C H, Lin C Y, et al. A26mW6.4GFLOPS Multi-core Stream Processor for Mo-bile Multimedia Applications[C]. Symp. VLSI Circuits (VLSIC) Dig. Tech. Papers,2008:24–25.
    [58] Yoon J S, Yu C H, Kim D, et al. A Dual-Shader3-D Graphics Processor With Fast4-D VectorInner Product Units and Power-Aware Texture Cache[J]. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems,2011,19(4):525–537.
    [59] Kim D, Kim L S. A Floating-Point Unit for4D Vector Inner Product with Reduced Latency[J].IEEE Transactions on Computers,2009,58(7):890–901.
    [60] Garland M, Grand S L, Nickolls J, et al. Parallel Computing Experiences with CUDA[J]. IEEEMicro,2008,28(4):82–90.
    [61] Kim H E, Yoon J S, Hwang K D, et al. A Reconfigurable Heterogeneous Multimedia Proces-sor for IC-Stacking on Si-Interposer[J]. IEEE Transactions on Circuits and Systems for VideoTechnology,2012,22(4):589–604.
    [62] Kim Y J, Chung K, Kim L S. Bank-Partition and Multi-Fetch Scheme for Floating-Point SpecialFunction Units in Multi-Core Systems[C]. Proceedings of IEEE International Symposium onCircuits and Systems (ISCAS),2009:1803–1806.
    [63] Kim H Y, Kim Y J, Kim L S. MRTP: Mobile Ray Tracing Processor With Reconfigurable StreamMulti-Processors for High Datapath Utilization[J]. IEEE Journal of Solid-State Circuits,2012,47(2):518–535.
    [64] Kim H Y, Kim Y J, Oh J H, et al. A Reconfigurable SIMT Processor for Mobile Ray TracingWith Contention Reduction in Shared Memory[J]. IEEE Transactions on Circuits and SystemsI: Regular Papers,2013,60(4):938–950.
    [65] Imagination Technology. PowerVR Series6IP Core[EB/OL],2010. http://www.imgtec.com/powervr/sgx_series6.asp.
    [66] ARM Inc. Mali Graphics plus GPU Compute[EB/OL],2012. http://www.arm.com/products/multimedia/mali-graphics-plus-gpu-compute/index.php.
    [67] Chang C M, Chen Y J, Lu Y C, et al. A172.6mW43.8GFLOPS Energy-Efcient ScalableEight-core3D Graphics Processor for Mobile Multimedia Applications[C]. Proc. IEEE AsianSolid-State Circuits Conference (A-SSCC),2011:405–408.
    [68] Sun C H, Lok K H, Chien S Y. CFU: Multi-Purpose Configurable Filtering Unit for Mobile Mul-timedia Applications on Graphics Hardware[C]. Proc. the Conf. on High-Performance Graphics(HPG),2009:29–36.
    [69] Hoppe H. Optimization of Mesh Locality for Transparent Vertex Caching[C]. Proc. ACMSIGGRAPH’99,1999:269–276.
    [70] Hoppe H. Universal Rendering Sequences for Transparent Vertex Caching of ProgressiveMeshes[C]. Proc. Graphics Interface Conf.’01,2001:81–90.
    [71] Chung K, Yu C H, Kim L S. Vertex Cache of Programmable Geometry Processor for Mo-bile Multimedia Application[C]. Proceedings of IEEE International Symposium on Circuits andSystems (ISCAS),2006:1908–1911.
    [72] Chung K, Yu C H, Kim D, et al. Shader-based Tessellation to Save Memory Bandwidth in aMobile Multimedia Processor[J]. Computers and Graphics,2009,33(5):625–637.
    [73] Kim S H, Yoon S E, Chung S H, et al. A Mobile3-D Display Processor With a Bandwidth-Saving Subdivider[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2012,20(6):1082–1093.
    [74] Fenney S. Texture Compression using Low-Frequency Signal Modulation[C]. Proc. GraphicsHardware2003,2003:84–91.
    [75] Strom J, Akenine-Moller T. iPACKMAN: High-Quality, Low-Complexity Texture Compressionfor Mobile Phones[C]. Proceedings of the SIGGRAPH/Eurographics Workshop on GraphicsHardware (GH),2005:63–70.
    [76] Nystad J, Lassen A, Pomianowski A, et al. Adaptive Scalable Texture Compression[C]. Proc.the Conf. on High-Performance Graphics (HPG),2012:105–114.
    [77] Sun C H, Tsao Y M, Chien S Y. High-Quality Mipmapping Texture Compression With AlphaMaps for Graphics Processing Units[J]. IEEE Transactions on Multimedia,2009,11(4):589–599.
    [78] Cho S, Yu C H, Kim L S. An Efcient Texture Cache for Programmable Vertex Shaders[C].Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS),2006:3834–3837.
    [79] Silpa B V N, Patney A, Krishna T, et al. Texture Filter Memory–A Power-efcient and ScalableTexture Memory Architecture for Mobile Graphics Processors[C]. Proc. Int. Conf. Computer-Aided Design (ICCAD),2008:559–564.
    [80] Akenine-Moller T, Strom J. Graphics for the Masses: A Hardware Rasterization Architecturefor Mobile Phones[J]. ACM Transactions on Graphics,2003,22(3):801–808.
    [81] Park W C, Lee K W, Han T D. An Efective Pixel Rasterization Pipeline Architecture for3DRendering Processors[J]. IEEE Transactions on Computers,2003,52(11):1501–1508.
    [82] Yoon J S, Yu C H, Kim D, et al. Triangle-Level Depth Filter Method for Bandwidth Reductionin3D Graphics Hardware[C]. Proceedings of IEEE International Symposium on Circuits andSystems (ISCAS),2007:765–768.
    [83] Yu C H, Kim D, Kim L S. An Area Efcient Early z-Test Method for3-D Graphics RenderingHardware[J]. IEEE Transactions on Circuits and Systems I: Regular Papers,2008,55(7):1929–1938.
    [84] Tsao Y M, Wu C L, Chien S Y, et al. Adaptive Tile Depth Filter for the Depth Bufer Band-width Minimization in the Low Power Graphics Systems[C]. Proceedings of IEEE InternationalSymposium on Circuits and Systems (ISCAS),2006:5023–5026.
    [85] Kim H Y, Yu C H, Kim L S. A Memory-Efcient Unified Early Z-Test[J]. IEEE Transactionson Visualization and Computer Graphics,2011,17(9):1286–1294.
    [86] Lee K W, Park W C, Kim I S, et al. A Pixel Cache Architecture with Selective PlacementScheme Based on z-Test Result[J]. Microprocessors and Mircrosystems,2005,29(1):41–46.
    [87] Jung T R, Van L D, Sheu T Y, et al. Design of Multi-Mode Depth Bufer Compression for3DGraphics System[C]. Proceedings of IEEE International Conference on Multimedia and Expo(ICME),2008:789–792.
    [88] Chien S Y, Lok K H, Lu Y C. Low-Decoding-Latency Bufer Compression for Graphics Pro-cessing Units[J]. IEEE Transactions on Multimedia,2012,14(2):250–263.
    [89] Bacchini F, Maliniak D, Dohery T, et al. ESL: Tales From the Trenches[C]. Proc. DesignAutomation Conference (DAC),2005:69–70.
    [90] Lee I, Kim J, Kim L S, et al. A Hardware-like High-level Language Based Environment for3D Graphics Architecture Exploration[C]. Proceedings of IEEE International Symposium onCircuits and Systems (ISCAS),2003:512–515.
    [91] Moya V, Gonzalez C, Roca J, et al. Shader Performance Analysis on a Modern GPU Architec-ture[C]. Proceedings of the Annual International Symposium on Microarchitecture (MICRO),2005:355–364.
    [92] Moya V, Gonzalez C, Roca J, et al. ATTILA: A Cycle-Level Execution-Driven Simulatorfor Modern GPU Architectures[C]. Proceedings of International Symposium on PerformanceAnalysis of Systems and Software (ISPASS),2006:231–241.
    [93] Juurlink B, Antochi I, Crisu D, et al. GRAAL: A Framework for Low-Power3D GraphicsAccelerators[J]. IEEE Computer Graphics and Applications,2008,28(4):63–73.
    [94] Antochi I, Juurlink B, Vassiliadis S, et al. GraalBench: A3D Graphics Benchmark Suite forMobile Phones[C]. Proceedings of ACM SIGPLAN Conference on Languages, Compilers, andTools for Embedded Systems (LCTES),2004:1–9.
    [95] Chen L B, Yeh C T, Chen H Y, et al. A System-Level Model of Design Space Explorationfor a Tile-Based3D Graphics SoC Refinement[J]. IEICE Transactions on Fundamentals ofElectronics, Communications and Computer Sciences,2009, E92-A(12):3193–3202.
    [96] Shen S T, Lee S Y, Chen C H. Full System Simulation with QEMU: an Approach to Multi-View3D GPU Design[C]. Proceedings of IEEE International Symposium on Circuits and Systems(ISCAS),2010:3877–3880.
    [97] Huang H Y, Huang C Y, Chen C H. Tile-Based GPU Optimizations through ESL Full Sys-tem Simulation[C]. Proceedings of IEEE International Symposium on Circuits and Systems(ISCAS),2012:1327–1330.
    [98]焦继业,穆荣,郝跃等.面向移动图形顶点处理器的高性能低功耗定点特殊函数运算单元设计[J].电子与信息学报,2011,33(11):2764–2770.
    [99]钟伟,郭立,杨毅.基于Cache和层次Z缓存算法的3维图形深度消隐硬件设计和实现[J].中国图像图形学报,2009,14(7):1392–1398.
    [100]杨毅.面向移动设备的真实感图形处理系统设计与实现[D].合肥:中国科学技术大学,2008.
    [101]高可,杨柯,石教英等. Coarse-Z Filter：降低深度带宽的图形流水单元[J].计算机辅助设计与图形学学报,2006,18(11):1658–1663.
    [102]韩俊刚,蒋林,杜慧敏等.一种图形加速器和着色器的体系结构[J].计算机辅助设计与图形学学报,2010,22(3):363–372.
    [103] Bellard F. QEMU: A Fast and Portable Dynamic Translator[C]. Proc. USENIX Annu. Tech.Conf.,2005:41–46.
    [104] Bhasker J. A SystemC Primer[M]. U.S.A: Star Galaxy Publishing, ISBN:0-9650391-8-8,2002.
    [105] ARM Inc. AMBA Open Specifications[EB/OL],2010. http://www.arm.com/products/system-ip/amba/amba-open-specifications.php.
    [106] Gligor M, Fournel N, Pe′trot F. Using Binary Tanslation in Event Driven Simulation for Fast andFlexible MPSoC Simulation[C]. Proc. the7th IEEE/ACM International Conference on Hard-ware/Software Codesign and System Synthesis (CODES+ISSS),2009:71–80.
    [107] ANSI IEEE Standard. IEEE Standard for Binary Floating-Point Arithmetic[M],1985.
    [108] Li G, Li Z. Design of A Fully Pipelined Single-Precision Multiply-Add-Fused Unit[C]. Proc.Conf. VLSI Design,2007:318–323.
    [109] Lang T, Bruguera J D. Floating-Point Multiply-Add-Fused with Reduced Latency[J]. IEEETransactions on Computers,2004,53(8):988–1003.
    [110] Schmookler M S, Nowka K J. Leading Zero Anticipation and Detection-A Comparison ofMethods[C]. Proc. IEEE Symp. Computer Arithmetic,2001:7–12.
    [111] Huang L, Shen L, Dai K, et al. A New Architecture for Multiple-Precision Floating-PointMultiply-Add Fused Unit Design[C]. Proc. IEEE Symp. Computer Arithmetic,2007:69–76.
    [112] Dinechin F D, Tisserand A. Multipartite Table Methods[J]. IEEE Transactions on Computers,2009,54(3):319–330.
    [113] Takagi N. Powering by a Table Look-up and a Multiplication with Operand Modification[J].IEEE Transactions on Computers,1998,47(11):1216–1222.
    [114] Cao J, Wei B, Cheng J. High-Performance Architectures for Elementary Function Genera-tion[C]. Proc. IEEE Symp. Computer Arithmetic,2001:136–144.
    [115] Pin eiro J A, Oberman S F, Bruguera J D, et al. High-Speed Function Approximation using aMinimax Quadratic Interpolator[J]. IEEE Transactions on Computers,2005,54(3):304–315.
    [116] Caro D D, Petra N, Strollo A G M. High-Performance Special Function Unit for Programmable3-D Graphics Processors[J]. IEEE Transactions on Circuits and Systems I: Regular Papers,2009,56(9):1968–1978.
    [117] Oberman S F, Siu M Y. A High-Performance Area-Efcient Multifunction Interpolator[C].Proc. IEEE Symp. Computer Arithmetic,2005:272–279.
    [118] Corporaal H. TTAs: Missing the ILP Complexity Wall[J]. Journal of Systems Architecture,1999,45(12-13):949–973.
    [119] Corporaal H, Janssen J, Arnold M. Computation in the Context of Transport Triggered Archi-tectures[J]. International Journal of Parallel Programming,2000,28(4):401–427.
    [120] Pitkanen T, Makinen R, Heikkinen J, et al. Low-Power, High-Performance TTA Processor for1024-point Fast Fourier Transform[C]. International Workshop on Embedded Computer Sys-tems-Architectures, Modeling, and Simulation (SAMOS),2006:227–236.
    [121] Janhunen J, Silven O, Juntti M. Programmable Processor Implementations of K-Best List SphereDetector for MIMO Receiver[J]. Signal Processing,2010,90(1):313–323.
    [122]岳虹,沈立,戴葵.基于TTA的嵌入式ASIP设计[J].计算机研究与发展,2006,43(4):752–758.
    [123]赵学秘,王志英,岳虹. TTA-EC：一种基于传输触发体系结构的ECC整体算法处理器[J].计算机学报,2007,30(2):225–233.
    [124] Microsoft. Shader Model3.0[EB/OL],2008. http://www.microsoft.com.
    [125] NVIDIA Corp. FX Composer2.0[EB/OL],2008. http://www.nvidia.com.
    [126] BEEcube Inc. BEEcube Inc.-High-performance Reconfigurable Processing Systems[EB/OL],2008. http://www.beecube.com.