基于SDTA结构的OpenGL ES关键技术实现与研究

英文题名：Research on the Design and Implementation Techniques of OpenGL ES Based on SDTA
作者：甘新标
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：OpenGL ; ES ; 同步数据触发体系结构 ; 表格驱动Cordic ; 内置函数 ; 底层构件 ; 优化
英文关键词：OpenGL ES ; SDTA ; Table-driven Cordic ; Built-in Functions ; Low-level Component ; Optimization
学位年度：2008
导师：戴葵
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2008-11-01

摘要

OpenGL ES(OpenGL for Embedded System),是嵌入式系统3D图形绘制编程接口。OpenGL ES在传统的操作触发体系结构(Operation Triggering Architecture, OTA)及其平台之间具有较好的移植性,但是针对特定体系结构和运行平台OpenGL ES库有不同的实现版本,如:Vincent OpenGL ES和J2ME-based OpenGL ES。
     同步数据触发体系结构(Synchronized Data Triggering Architecture,SDTA)与传统的OTA结构有着本质上的区别,它是基于传输触发体系结构(Transport Triggering Architecture, TTA)改进而发展起来的高性能数据并行体系结构。因此,面向SDTA结构研究OpenGL ES并实现其关键技术具有重要的意义。本文的研究成果主要有:
     1.表格驱动Cordic算法(Table-driven Cordic, T-Cordic)
     T-Cordic算法实质上是一种面向数据密集型的Cordic改进算法,它将经典Cordic算法的迭代转换为简单的2×2矩阵乘法。系数矩阵元素值预先计算好存放在表格中,通过查表操作可以快速获得各个元素值。基于SDTA指令集结构实现的T-Cordic算法性能提升显著。
     2.基于SDTA结构的OpenGL ES内置函数优化
     OpenGL ES内置函数是OpenGL ES库的基础和核心。基于SDTA指令集结构的指令优化技术(循环展开、强度消弱、指令归并、子字并行等)是OpenGL ES内置函数性能提升的关键技术。
     3. OpenGL ES底层构件的设计与优化
     本文提出了OpenGL ES多级结构设计模型;基于函数依赖度策略提取了OpenGL ES实现的一个核心子集,即OpenGL ES底层构件。同时,基于SDTA指令集结构的子字并行原语编程模型是OpenGL ES底层构件优化实现的关键技术。
     4.面向SDTA结构定制Unified Shader功能单元
     可编程Shader是OpenGL ES支持可编程能力的前提。本文在研究独立Vertex Shader和Fragment Shader功能单元结构的基础之上提出了面向SDTA结构的Unified Shader单元设计。它是将Vertex Shader和Fragment Shader处理单元的统一起来构造一个既能执行Vertex Shader又能执行Fragment Shader的处理单元。
In order to simplify the differences of platforms, Khronos Group has organized the international standards named OpenGL ES (OpenGL for Embedded System) for embedded graphics programming. Though OpenGL ES is conditionally transplantable between traditional and popular Operation Triggering Architectures (OTA), OpenGL ES library has different implementing versions such as Vincent OpenGL ES and J2ME-based OpenGL ES for specified architectures and systems.
     Synchronized Data Triggering Architecture (SDTA), based on Transport Triggering Architecture (TTA), is a high-performance data parallel architecture which is different from the OTA. Therefore, SDTA-oriented research on OpenGL ES is urgent and impotant for implementation of OpenGL ES for SDTA instruction set architecture. The innovative work in this thesis can be summarized as follows.
     1. Table-driven Cordic(T-Cordic)
     T-Cordic, which has transformed iterations in classic Cordic into simple 2×2 matrix multiplication and coeffient matrix elements can be available by looking-up tables, is designed for data-intensive instruction set architecture in reality, and proved that it can achieve high-performance significantly on SDTA.
     2. Optimizing functions built-in OpenGL ES based on SDTA
     Functions built-in OpenGL ES is the kenel of OpenGL ES library. It is particularly important to optimize and schedule the instructions for implementing the built-in functions on SDTA. The techniques of optimization include loop unrolling, strength reduction, instruction combing and sub-word parallel.
     3. Design and optimization of low-level component for OpenGL ES
     Multi-level structure model for OpenGL ES is proposed, and entity dependency for functions and data structions is defined in this thesis. The core subset of OpenGL ES distilled according to statistic entity dependency is called low-level component for OpenGL ES. Furthermore, the sub-word parallel programming model for SDTA instruction set architecture is a key technique for implementation and optimization of low-level component of OpenGL ES.
     4. SDTA-oriented unified shader unit
     Programmable shader unit is a prerequisite for OpenGL ES supporting programmability. Unified shader, which unified the structure and function of isolated vertex shader and fragment shader, is proposed after studies on tradional vertex shader and fragment shader.

引文

[1] Guo J, Dai K, Wang Z. A high performance heterogeneous architecture and its optimization design. 2nd International Conference on High Performance Computing and Communications, HPCC 2006, Sep 13-15 2006, Munich, Germany, 2006. Munich, Germany: Springer Verlag, Heidelberg, D-69121, Germany, 2006: 300-309.
    [2] Marnix Arnold, Reinoud Lamberts H C. High Performance Image Processing using TTAs. In Second Annual Conf. of ASCI, Belgium, 1996. Belgium: 1996: 150-155.
    [3] Shu C F. New Products. IEEE MultiMedia. 2005, 12(2): .3.
    [4] http://www.sumzi.com/en/articles/11/1640.html.
    [5] Peleg A, Wilkie S, Weiser U. Intel MMX for multimedia PCs. Commun. ACM. 1997, 40(1): 24-38.
    [6] Yang X, Lee R B. PLX FP: An efficient floating-point instruction set for 3D graphics. 2004 IEEE International Conference on Multimedia and Expo (ICME), Jun 27-30 2004, Taipei, Taiwan, 2004. Taipei, Taiwan: Institute of Electrical and Electronics Engineers Inc., New York, NY 10016-5997, United States, 2004: 137-140.
    [7] Ruby B. Lee A M F. PLX: A Fully Subword-Parallel Instruction-Set Architecture for Fast Scalable Multimedia Processing. Proceedings of the 2002 IEEE International Conference on Multimedia and Expo (ICME 2002), 2002. 2002: 117-120.
    [8] Knittel G. A Compact Shader for FPGA-Based Volume Rendering Accelerators. Reconfigurable Computing: Architectures, Tools and Applications. 2007: 271-282.
    [9] Kinane A, O C N. Energy-efficient Hardware Accelerators for the SA-DCT and Its Inverse. The Journal of VLSI Signal Processing. 2007, 47(2): 127-152.
    [10] Chan E, Ng R, Sen P, et al. Efficient partitioning of fragment shaders for multipass rendering on programmable graphics hardware. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Saarbrucken, Germany, 2002. Saarbrucken, Germany: Eurographics Association, 2002: 69-78.
    [11] Kim T Y, Kim J, Hur H. A Unified Shader Based on the OpenGL ES 2.0 for 3D Mobile Game Development. Technologies for E-Learning and Digital Entertainment. 2007: 898-903.
    [12] Lee R, Fiskiran A. PLX: An instruction set architecture and testbed for multimedia information processing. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. 2005, 40(1): 85-108.
    [13] Lee R, Fiskiran A. PLX: An instruction set architecture and testbed for multimedia information processing. Journal of VLSI Signal Processing Systems for Signal,Image, and Video Technology. 2005, 40(1): 85-108.
    [14] Camposano R, Wilberg J. Embedded system design. Design Automation for Embedded Systems. 1996, 1(1): 5-50.
    [15] Marwedel P. Embedded System Design. Springer-Verlag New York, Inc., 2006.
    [16] OpenGL ES 2.0 specification. Available at http://www.khronos.org/opengles/.
    [17]吴丹.面向多媒体处理的子字并行编译优化方法研究.国防科学技术大学博士学位论文, 2006.
    [18] Talla D, John L K. Execution characteristics of multimedia applications on a Pentium II processor. IEEE International Performance, Computing & Communications Conference, Proceedings. 2000: 516-524.
    [19] Hassaballah M, Omran S, Mahdy Y B. A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications. The Computer Journal. 2008.
    [20] Oberman S, Favor G, Weber F. AMD 3DNow! Technology: Architecture and Implementations. IEEE Micro. 1999, 19(2): 37-48.
    [21] Lindholm E, Kligard M J, Moreton H. A user-programmable vertex engine. Proceedings of the 28th annual conference on Computer graphics and interactive techniques, 2001. ACM, 2001: 149-158.
    [22] Hansen C. MicroUnity's MediaProcessor Architecture. IEEE Micro. 1996, 16(4): 34-41.
    [23] Mentors Drs H C. Implementation of realtime video compression, conform the MPEG standard, using the Transport Triggered Architecture. Delft University of Technology, Faculty of Electrical Engineering, 1996.
    [24] Patterson D A, Hennessy J L. Computer architecture: a quantitative approach(3rd). Morgan Kaufmann Publishers Inc., 2003.
    [25] Ruby B L. Subword Parallelism with MAX-2. IEEE Micro. 1996, 16(4): 51-59.
    [26] Wang M, Wu G, Wang Z. Instruction Selection for Subword Level Parallelism Optimizations for Application Specific Instruction Processors. 2007: 946-957.
    [27] Wolfgang Engel. Beginning Direct3D Game Programming(2nd Edition). Pubisher: Premier Press, 2003.
    [28] Khronos Group Std.: OpenVG, Kronos Grouop Standard for Vector Graphics Accelerations. 2005.
    [29]邱志云,张林,邹永贵.基于移动设备的Mobile SVG的研究.重庆邮电学院学报:自然科学版, 2006, 18(4): 499-502.
    [30] Mochocki B, Lahiri K, Cadambi S. Power analysis of mobile 3D graphics. Proceedings of the conference on Design, automation and test in Europe: Proceedings, Munich, Germany, 2006. Munich, Germany: European Design and Automation Association, 2006: 502-507.
    [31] Durnil D A A D. OpenGL ES Game Development. CourseTechnology PTR, 2004.
    [32] OpenGL ES 1.5 specification. Available at http://www.khronos.org/opengles/.
    [33] http://ogl-es.sourceforge.net.
    [34] Tu C H, Chen B Y. The Architecture of a J2ME-based OpenGL ES 3D Library. Computer Aided Design and Computer Graphics, International Conference on. 2005, 0: 423-427.
    [35] Rost R J. OpenGL(R) Shading Language. Addison Wesley Longman Publishing Co., Inc., 2004.
    [36] Simpson R. OpenGL ES 2.0 programmable pipeline. ACM SIGGRAPH 2006 Courses, Boston, Massachusetts, 2006. Boston, Massachusetts: ACM, 2006: 4.
    [37] Munshi A, Ginsburg D, Shreiner D. Opengles 2.0 programming guide. Addison-Wesley Professional, 2008.
    [38] Hoogerbrugge J, Corporaal H. Transport-triggering vs. operation-triggering. Compiler Construction. 1994: 435-449.
    [39] http://www.khronos.org/opengles/.
    [40]赵学秘,王志英,岳虹,等. TTA-EC:一种基于传输触发体系结构的ECC整体算法处理器.计算机学报, 2007, 30(2): 225-233.
    [41]岳虹,沈立,戴葵,等.基于TTA的嵌入式ASIP设计.计算机研究与发展, 2006, 43(4): 752-758.
    [42]岳虹.嵌入式异构多核处理器设计与实现关键技术研究.国防科学技术大学博士学位论文, 2006.
    [43] Lee R B. Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures. Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, 2000. IEEE Computer Society, 2000: 3.
    [44] Fridman J, Greenfield Z. The TigerSHARC DSP Architecture. IEEE Micro. 2000, 20(1): 66-76.
    [45] Peter P T. Table-driven implementation of the logarithm function in IEEE floating-point arithmetic. ACM Trans. Math. Softw. 1990, 16(4): 378-400.
    [46] Peter P T. Table-driven implementation of the exponential function in IEEE floating-point arithmetic. ACM Trans. Math. Softw. 1989, 15(2): 144-157.
    [47] Tang P T. Table-lookup algorithms for elementary functions and their error analysis. Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Jun 26-28 1991, Grenoble, Fr, 1991. Grenoble, Fr: Publ by IEEE, Piscataway, NJ, USA, 1991: 232-236.
    [48] Volder J E. The CORDIC Trigonometric Computing Technique. IRE Trans. on Electronic Computing. 1959, EC-8: 330-334.
    [49] Rodrigues T K, Swartzlander J E. Adaptive CORDIC: Using parallel angle recoding to accelerate CORDIC rotations. 40th Asilomar Conference on Signals,Systems, and Computers, ACSSC '06, Oct 29-Nov 1 2006, Pacific Grove, CA, United States, 2006. Pacific Grove, CA, United States: Institute of Electrical and Electronics Engineers Computer Society, Piscataway, NJ 08855-1331, United States, 2006: 323-327.
    [50] Kuhlmann M, Parhi K K. P-CORDIC: A precomputation based rotation CORDIC algorithm. Eurasip Journal on Applied Signal Processing. 2002, 2002(9): 936-943.
    [51] Wang S, Piuri V, Swartzlander E E. Hybrid CORDIC algorithms. IEEE Transactions on Computers. 1997, 11: 1202-1207.
    [52] Tso-pin Chuang S H A C H. Design of a CORDIC-Based SIN/COS Intellectual Property (IP) Using Predictable Sign Bits. 27th European Solid-State Circuits Conference (ESSCIRC), 2001. 2001: 292-295.
    [53] Antelo E, Bruguera J D, Zapata E L. Unified Mixed Radix 2-4 Redundant CORDIC Processor. IEEE Trans. Comput. 1996, 45(9): 1068-1073.
    [54] Walther J. A Unified Algorithm for Elementary Functions. Joint Computer Conference Proceedings, 1971. 1971: 379-385.
    [55] Pirsch P. Architectures for Digital Signal Processing. John Wiley; Sons, Inc., 1998.
    [56] Bajard J C, Kla S, Muller J M. BKM: A New Hardware Algorithm for Complex Elementary Functions. IEEE Trans. Comput. 1994, 43(8): 955-963.
    [57] Wang S, Piuri V, Swartzlander E E. Unified view of CORDIC processor design. Proceedings of the 1996 IEEE 39th Midwest Symposium on Circuits & Systems. Part 2 (of 3), Aug 18-21 1996, Ames, IA, USA, 1996. Ames, IA, USA: IEEE, Piscataway, NJ, USA, 1996: 852-855.
    [58] Bruguera J D, Antelo E, Zapata E L. Design of a pipelined radix 4 CORDIC processor. Parallel Computing. 1993, 19(7): 729-744.
    [59] Hoogerbrugge J, Corporaal H. Comparing software pipelining for an operation-triggered and a transport-triggered architecture. Compiler Construction. 1992: 219-228.
    [60] Steven S Muchnick著.赵克佳,沈志宇译.高级编译器设计与实现.北京:机械工业出版社, 2005.
    [61] http://ati.amd.com/developer/.
    [62] Miller J R. Vector Geometry for Computer Graphics. IEEE Comput. Graph. Appl. 1999, 19(3): 66-73.
    [63] Ducker M. Matrix and Vector Manipulation for Computer Graphics. 2000.
    [64] Kim T Y, Oh K S. Design of a Programmable Vertex Processing Unit for Mobile Platforms. Emerging Directions in Embedded and Ubiquitous Computing. 2006: 805-814.
    [65]阙恒.嵌入式图形处理器设计.南京航空航天大学硕士学位论文, 2007.
    [66]莫军.基于嵌入式的3D游戏引擎技术的研究与实现.电子科技大学硕士学位论文, 2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700