基于超长指令字模板高精度算法加速器体系结构研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于超长指令字模板高精度算法加速器体系结构研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on the Hardware Acceleration for High-precision Algorithm Based-on Very Long Instruction Word Framework
作者：雷元武
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：高精度算术 ; 可重构计算 ; 超长指令字 ; 向量内积 ; 基本函数
英文关键词：high-precision arithmetic ; reconfigurable computing ; Very Long
英文关键词：Instruction Word ; vector inner product ; elementary function
学位年度：2012
导师：窦勇
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2012-03-01

摘要

科学计算已经成为继理论研究和物理实验之后，现代科学研究的第三种手段，其计算结果的精度将直接影响科学研究的成果和成败。随着计算规模不断扩大，科学计算中浮点运算的舍入误差累积加剧，这导致计算结果不精确、不可靠、甚至不正确。高精度算术是保证大规模科学计算精度最直接、有效、可靠的方法，同时它具有提高算法可再现性、增强算法稳定性、加快算法收敛速度等优势。然而，基于CPU或GPU的通用计算平台，内部定制了确定宽度的数据通路和固定精度的运算单元，只能通过软件模拟的方式实现多种高精度浮点算术，这导致计算性能和效率低。
     近年来，FPGA器件以其可定制、可重构、高性能、低功耗的优势，成为理想的加速计算平台。本文将FPGA可重构技术、超长指令字（VLIW）技术与高精度计算相结合，探索解决基于FPGA的高精度算法加速器设计面临的关键问题，开发高精度应用中不同层次的并行性和最大化FPGA的性能和资源利用率。本文取得的主要研究成果如下：
     1、提出一个适应高精度运算的处理器体系结构——定制VLIW模板。VLIW技术是挖掘算法并行性的一种理想方法，具有硬件结构简单、性能高和扩展性好的特点。本文针对高精度运算的特征，在FPGA平台上定制了一个VLIW模板结构，内部集成多个定制高精度基本运算单元，通过VLIW指令的显式并行技术来开发高精度运算中的指令级并行。基于此模板建立可配置的多VLIW核的高精度算法加速器体系结构，开发高精度应用算法中线程级并行。最后，针对VLIW技术中的关键问题——代码膨胀，提出一种适合FPGA平台的多级索引VLIW指令压缩技术，使用标志位和多存储体方式解决传统代码压缩技术中的VLIW指令长度不确定问题，最大限度避免空操作带来指令空间浪费。在基于定制VLIW模板的四精度基本函数处理器和四精度算法加速器设计中，该压缩策略的压缩率分别为37.5%和24.5%。
     2、提出基于全展开的精确四精度向量内积算法及实现结构。针对科学计算中最常见的、对数值算法稳定性和结果精度影响较大的基本操作——向量内积，本文提出基于全展开的精确四精度向量内积算法和实现结构（Quad-HPMAC），采用无损失的定点操作获得精确内积结果，采用累加和的两级存储结构、累加和划分及进位保留累加等优化策略来提高Quad-HPMAC单元的频率和吞吐率。最后，基于Quad-HPMAC模块建立统一四精度矩阵运算加速器，实现矩阵乘、LU分解和MGS-QR分解算法。实验结果表明，相对于通用Intel多核平台上并行软件实现，该加速器能够取得5~8位的精度提升和40倍以上的性能提升。
     3、提出基于VLIW模板的统一四精度基本函数计算模型及实现结构。针对科学计算中基本函数种类多、实现复杂、使用频率低、计算延时大的特征，本文提出基于VLIW模板的统一四精度基本函数计算模型和实现结构（QP_VELP）。该结构具有性能高和扩展性好的优势，利用Estrin策略提高多项式计算的并行性，通过循环展开、流水线并行和VLIW指令显式并行技术提高性能。与相关工作相比，统一基本函数处理器不仅在资源消耗、延时、精度等方面占优，而且该处理器能够使用统一硬件资源实现多种基本函数的计算，在实际科学和工程应用中取得较高的资源利用率。
     4、提出基于VLIW模板的四精度算法加速器结构。本文针对科学计算中不规则类计算密集型算法，以空间目标轨道预测SGP4/SDP4算法为例，提出基于VLIW模板的四精度算法加速器结构。通过集成QP_VELP模块实现多种使用频率低的基本函数，解决基本操作种类多的问题；通过定制VLIW指令的约束来满足操作之间复杂的数据依赖关系；通过多个四精度操作单元的并行执行来开发算法的指令级并行性；通过多个VLIW核的并行执行来开发算法的线程级并行。同时，本文还提出基于贪婪思想的指令调度算法，结合存储空间分配及冲突检测，实现算法的数据流图到定制VLIW指令槽的映射，最大限度地减少定制VLIW指令中的空操作。实验结果表明，相对于Intel多核处理器，该四精度算法加速器能够取得7.8~15倍的性能提升。
     5、针对某些计算精度要求更高的特定科学应用领域，本文将四精度算法加速器中的相关概念、研究及实现方法扩展到任意精度浮点算术系统中。提出基于全展开的任意精度精确向量内积算法及实现结构（VPMAC）和基于VLIW模板的任意精度基本函数处理器（VP_VELP），其中VP_VELP内部集成多个任意精度基本操作单元，通过VLIW指令的显式并行技术和动态改变内部计算精度的方法来提高性能，使用统一硬件资源实现多种任意精度基本操作和任意精度基本函数。最后，通过VPMAC协处理器和统一任意精度矩阵加速器（VPMATA）这两种方式实现任意精度矩阵类算法。实验结果表明：相对于Intel四核处理器上的并行MPFR函数库，集成8个VPMAC模块和1个VP_VELP模块的VPMATA能够获得13~63倍的加速效果。
Scentific computing becomes the third mode for scientific discovery beyond theoryand experient. Most of them operate on floating-point arithmetic, in which roundingerror is an unavoidable consequence. And the accumulation of rounding errors leads toinaccurate, unreliable and even wrong results. Thus, many scientific applications rely onthe high-precision arithmetic. However, the performance of high-precision arithmetic ingeneral-purpose processor is very poor since most of them are accomplished bysoftware emulation with fixed-precision operations, such as64-bit floating-point.
     Field-Programmable Gate Arrays (FPGAs) have advantages over CPU in terms ofcustomizable, reconfigurable, performance, and power consumption, so the use ofFPGA-based accelerators has become a promising approach for speed up scientificapplications. In this thesis, we implement high-precision floating-point arithmetic onFPGAs to explore the capability and flexibility of FPGA solutions in sense to acceleratehigh-precision scientific applications. In summary, this thesis makes the followingcontributions:
     (1) We propose a parameterizable Very Long Instruction Word (VLIW) frameworkon FPGAs, which features with less hardware complexity, high performance, and highscalability. Based on this formwork, a hardware accelerator with multiple VLIW kernelsis presented to exploit instruction level parallelism (ILP) and thread level parallel (TLP)in high-precision applications simultaneously. In order to solve the code densityproblem in VLIW implementation, we propose a mult-level index code compressionscheme for custom VLIW framework on FPGAs. For each unit, a flag is used toindicate whether this unit is used and a RAM is built to store the used operation. Thisscheme can solve the uncertain length of VLIW instruction in tradition codecompression method and avoid explicit no-ops fully.
     (2) We propose exact vector inner product algorithm and structure (Quad-HPMAC)for IEEE-754(2008) standard quadruple precision floating-point arithmetic. A very longfixed-point register is employed to store the summation without information loss andexact fixed-point operations, instead of floating-point operations, are used to gain exactresults. Several schemes, such as two-level RAM banks structure for summation, partialsummation scheme, and carry-save accumulation scheme, are introduced to improve thefrequency and throughput of Quad-HPMAC unit. Finally, a prototype of the unifiedmatrix accelerator, equipped with4Quad-HPMAC units, is presented to implementtypical quadruple precision matrix computation algorithms, such as matrixmultiplication, LU decomposition, and MGS-QR decomposition. Experimental resultsshow that our design outperforms general-purpose processors in terms of precision,performance, and power consumption.
     (3) We propose a special-purpose processor (QP_VELP) based on the customVLIW framework, which used the unified hardware to efficiently evaluate variousquadruple precision elementary functions. This processor is well match up to thefeatures of elementary functions in scientific applications, such as high implementationcomplexity, low use frequency, and high latency. The pipelined implementation ofpolynomial approximation with Estrin scheme is addressed to enhance the ILP. Theperformance of QP_VELP is improved through loop unrolling technique and explicitlyparallel of VLIW instruction. Compared to the related work, our design achieves higherprecision and lower latency with less resource consumption. Moreover, our solution forelementary functions can achieve high resource utilization.
     (4) Taking the orbit prediction algorithm of spatial object (SGP4/SDP4) as anexample, we present a VLIW-based architecture for quadruple precision scientificapplications. The QP_VELP unit is integrated into this accelerator to implement variouselementary functions in SGP4/SDP4with the unified hardware. Multiple basicquadruple precision operation units in this accelerator can be executed in parallel toexploit the ILP and TLP in SGP4/SDP4. Meanwhile, we propose a greedy algorithm,which schedules the operations in the data flow graph of SGP4/SDP4algorithm into thecustom VLIW instruction, and generates the VLIW instruction sequence with littleno-ops. Experimental results show that our VLIW-based accelerator exhibits speedupperformance and power advantage compared to general-purpose processor.
     (5) We extend the concept, research method, and implementation scheme in thedesign of quadruple precision algorithm accelerator to arbitrary precision arithmeticsystem. First, we address the exact vector inner product structure (VPMAC) forarbitrary precision floating-point arithmetic, which uses the exact fixed-point operationto avoid the introduction of rounding errors. Then, we address the processor (VP_VELP)based on the custom VLIW framework for arbitrary precision elementary functions. Theperformance of VP_VELP is improved through the explicitly parallel technology ofVLIW instruction and by dynamically varying the precision of intermediatecomputation. Finally, two schemes, called the VPMAC coprocessor and the unifiedmatrix accelerator (VPMATA), are presented to accelerate the typical arbitraryprecision matrix computation algorithms. Experimental results show that the VPMATA,equipped with8VPMAC units and1VP_VELP unit, achieves13X-63X betterperformance.

引文

[1]余德浩.计算数学与科学工程计算及其在中国的若干发展[J].数学进展,2002,31(1):1~6.
    [2] TOP500. http://www.top500.org, December2011.
    [3] Tom Jowitt,“IBM to build new monster supercomputer,” TechWorld,Wednesday,4February2009.
    [4] Peter Kogge. ExaScale Computing Study: Technology Challenges in AchievingExascale Systems [R]. DARPA-2008-13,2008.
    [5] LINPACK. http://www.netlib.org/linpack.
    [6]周毓麟.科学计算用数字电子计算机的若干问题[J].数学进展,1989,18(4):433~438.
    [7]周毓麟,袁国兴.关于科学计算用数字电子计算机字长问题[J].计算机工程与科学,2005,27(10):1~3.
    [8] F.R. Hoots and R.L. Roehrich. Models for Propagation of NORAD Element Sets[R]. SPACETRACK Report No.3,1988,http://128.54.16.15/amsat/ftp/docs/spacetrk.pdf.
    [9] E. Loh and G. Walster. Rump’s example revisited [J]. Reliable Computing,2002,8(3):245~248.
    [10] D.H. Bailey. High-precision floating-point arithmetic in scientific computation.Computing in Science&Engineering,2005,7(3):64~61.
    [11] D.H. Bailey and J.M Borwein. Experimental mathematics--Examples methodsand Implications [J], Notices Amer. Math. Soc.,2005,52(5):502~514.
    [12] D.H. Bailey and J.M. Borwein and R. Barrio. High-Precision Computation:Mathematical Physics and Dynamics.2009.
    [13] N. J. Higham. Accuracy and Stability of Numerical Algorithms (2nd edition)[M].Society for Industrial and Applied Mathematics Philadelphia, PA, USA,2002.
    [14] Y. He and C. Ding. Using Accurate Arithmetic to Improve NumericalReproducibility and Stability in Parallel Applications [J]. Journal ofSupercomputing,2001,18(3):259~277.
    [15] K. Diethelm. The limits of reproducibility in numerical simulation [J].Computing in Science&Engineering,2012,14(1):64~74.
    [16] M. Lu, B. He and Q. Luo. Supporting Extended Precision on GraphicsProcessors [C]. Proceedings of the Sixth International Workshop on DataManagement on New Hardware (DaMon’10),2010:19~26.
    [17] M. Taufer, O. Padron, P. Saponaro, and S. Patel. Improving numericalreproducibility and stability in large-scale numerical simulations on GPUs [C].Proceedings of2010IEEE International Symposium on Parallel&DistributedProcessing (IPDPS),2010:1~9.
    [18] M. Badin, L. Bic, M. Dillencourt, and A. Nicolau. Improving Accuracy forMatrix Multiplications on GPUs [J]. Scientific Programming.2011,19(1):3~11.
    [19] V. Volkov and J. Demmel. Benchmarking GPUs to tune dense linear algebra[C].Proceedings of the ACM/IEEE Conference on High Performance Computing(SC’08),2008.
    [20] M. Badin, L. Bic, M. Dillencourt, and A. Nicolau. Improving Accuracy throughSelective Doubly Compensated Summation [C]. Workshop on Language,Compiler, and Architecture Support for GPGPU,2010.
    [21] M. Badin, L. Bic, M. Dillencourt, and A. Nicolau. Pretty Good Accuracy inMatrix Multiplication with GPUs [C]. Proceedings of Ninth InternationalSymposium on Parallel and Distributed Computing,2010.
    [22] P. Zimmermann. Reliable computing with GNU MPFR [C]. Proceedings of theThird international congress conference on Mathematical software,2010.
    [23] G. Lake, T. Quinn and D. C. Richardson. From Sir Isaac to the Sloan survey:Calculating the structure and chaos due to gravity in the universe [C].Proceedings of the8th ACM-SIAM Symposium on Discrete Algorithms,1997,1~10.
    [24] P.H. Hauschildt and E. Baron. The Numerical Solution of the Expanding StellarAtmosphere Problem [J]. Journal Computational and Applied Mathematics,1999,109:41~63.
    [25] A.M. Frolov and D.H. Bailey. Highly Accurate Evaluation of the Few-BodyAuxiliary Functions and Four-Body Integrals [J]. Journal of Physics B,2003,36(9):1857~1867.
    [26] D.H. Bailey and A.M. Frolov. Universal Variational Expansion forHigh-Precision Bound-State Calculations in Three-Body Systems: Applicationsto Weakly-Bound, Adiabatic and Two-Shell Cluster Systems [J]. Journal ofPhysics B,2002,35(20):4287~4298.
    [27] H. Hasegawa. Utilizing the quadruple-precision floating-point arithmeticoperation for the Krylov Subspace Methods [C]. Proceedings of the SIAMconference on Applied Linear Algebra.2003.
    [28] G. Gambolati, G. Pini, and M. Ferronato. Scaling improves stability ofpreconditioned CG‐l ike solvers for FEconsolidation equations [J].International Journal for Numerical and Analytical Methods in Geomechanics,2003,27(12):1043~1056.
    [29] J. Fujimoto, T. Ishikawa, D. Perret-Gallix. High precision numericalcomputations-A case for an HAPPY design [R]. ACPP IRG note. ACPP-N-1:KEK-CP-164,2005.
    [30] P. Wang, G. Huang, and Z. Wang. Analysis and application of multiple-precisioncomputation and round-off error for nonlinear dynamical systems [J]. Advancesin atmospheric sciences,2006,23(5):758~766.
    [31] J. Li, Q. Zeng, and J. Chou. Computational uncertainty principle in nonlinearordinary differential equations—I Numerical Results [J]. Science in China(Series E),2000,43(5):449~461.
    [32] J. Li, Q. Zeng, and J. Chou. Computational uncertainty principle in nonlinearordinary differential equations—II Theoretical analysis [J]. Science in China(Series E),2001,44(1):55~74.
    [33] A. Abad, R. Barrio, F. Blesa and M. Rodriguez. TIDES: a Taylor seriesIntegrator for Differential EquationS [J]. ACM Transaction on MathematicsSoftware, in press,2010.
    [34] Z.-C. Yan and G.W.F. Drake. Be the Logarithm and QED Shift for Lithium [J].Physical Rev. Letters,2003,81:774~777.
    [35] B.E. Barrowes. Asymptotic Expansions of the Prolate Angular Spheroidal WaveFunction for Complex Size Parameter [J]. Applied Mathematics,2005.
    [36] D. H. Bailey, K. Jeyabalan, and X. S. Li. A comparison of three high-precisionquadrature schemes [J]. Experimental Mathematics,2004,14:317~329.
    [37] H.R.P. Ferguson, D.H. Bailey and S. Arno. Analysis of PSLQ, An IntegerRelation Finding Algorithm [J]. Mathematics of Computation,1999,68(225):351~369.
    [38] D.H. Bailey and D. J. Broadhurst. Parallel Integer Relation Detection:Techniques and Applications [J]. Mathematics of Computation,2000,70(236):1719~1736.
    [39] D. H. Bailey, P. B. Borwein and S. Plouffe. On the rapid computation of variouspolylogarithmic constants [J]. Mathematics of Computation,1997,66(218):903~913.
    [40] J. M. Borwein, W. F. Galway and D. Borwein. Finding and excluding b-aryMachin-type BBP formulae [J]. Canadian Journal of Mathematics,2004,56:1339~1342.
    [41] D.H. Bailey, J. M. Borwein and D. M. Bradley. Experimental Determination ofApery-Like Identities for Zeta(2n+2)[J]. Experimental Mathematics,2006,15:281~289.
    [42] T.J. Todman, G.A. Constantinides, S.J.E. Wilton, O. Mencer, W. Luk, and P.Y.K.Cheung. Reconfigurable Computing: Architectures and Design Methods [C].Proceedings of Computer Digital Technology,2005,152(2):193~207.
    [43] A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, and O. O. Storaasli.State-of-the-art in heterogeneous computing [J]. Scientific programming,2010,18:1~33.
    [44] W. Carter, K. Duong, R.H. Freeman, H. Hsieh, J.Y. Ja, J.E. Mahoney, L.T. Ngo,and S. L. Sze. A User Programmable Reconfiguration Gate Array [C].Proceedings of the IEEE Custom Integrated Circuits Conference,1986:233~235.
    [45] Altera Inc., Introducing Innovations at28nm to Move Beyond Moores Law.Available: www.altera.com/literature/wp/wp-01125-stxv-28nm-innovation.pdf.
    [46] M. Lanzagorta, S. Bique, and R. Rosenberg. Introduction to ReconfigurableSupercomputing [M]. Morgan&Claypool publishers,2009.
    [47] SRC Computers. http://www.srccomp.com
    [48] Cray Inc. Cray XD1Datasheet,2004.
    [49] O.O. Storaasli. HPC Accelerator Research100X Speedup with FPGAs [C].Proceedings of ACM/IEEE Supercomputing (SC’09),2009.
    [50] Novo-G. http://www.hcs.ufl.edu/lab/novog.php.
    [51] HPCWire.http://www.hpcwire.com/features/The-Weekly-Top-Five-20110217.html,2011.
    [52] A.D. George, H. Lam, A. Lawande, and C. Pascoe. Novo-G: Adaptively CustomReconfigurable Supercomputer.2009.
    [53] IEEE Standard for Binary Floating-Point Arithmetic [S]. New York: ANSI/IEEE754-1985,1985
    [54] IEEE. Standard for binary floating point arithmetic ANSI/IEEE standard754-2008. The Institute of Electrical and Electronic Engineers, Inc.,2008.
    [55] Y. Hida, X.S. Li, D.H. Bailey. Algorithms for quad-double precisionfloating-point arithmetic [C]. Proceeding of the15th IEEE Symposium oncomputer arithmetic (ARITH’01),2001:155~162.
    [56] L. Fousse, G. Hanrot, V. Lefevre, P. Pelissier, and P. Zimmermann. MPFR: Amultiple-precision binary floating-point library with correct rounding [J]. ACMTransactions on Mathematical Software,2007,33(2):1~13.
    [57] M. J. Schulte and E. E. S. Jr. Hardware design and arithmetic algorithms for avariable-precision, interval arithmetic coprocessor [C]. Proceedings of the12thSymposium on Computer Arithmetic (ARITH’95),1995:222~228.
    [58] D. Goldberg. What every scientist should know about floating-point arithmetic[J]. ACM Computing Surveys,1991,23(1):5~48.
    [59] i846TMMicroprocessor Hardware Reference Manual. Intel Corp.,1990
    [60] OpenSPARCTMT2Core Microarchitecture Specification.http://www.opensparc.net/opensparc-t2/index.html.
    [61] B. Stolt, Y. Mittlefehldt, S. Dubey, G. Mittal, M. Lee, J. Friedrich, E. Fluhr,Design and Implementation of the POWER6Microprocessor [J], IEEE Journalof Solid-State Circuits,2008,43(1):21~28.
    [62] AMD”Bulldozer” Core Technology.http://www.sgi.com/partners/technology/downloads/ADM_Bulldozer_Core_Technology.pdf.
    [63] R.P. Brent. A Fortran Multiple-Precision Arithmetic Package [J]. ACMTransaction on Mathematical Software,1978,4(1):57~70.
    [64] GNU MP: The GNU Multiple Precision Arithmetic Library, http://gmplib.org/.
    [65] MPFR: The GNU MPFR Library. www.mpfr.org.
    [66] D.H. Bailey, Y. Hida, X.S. Li and B. Thompson. ARPREC: An ArbitraryPrecision Computation Package.2002.
    [67] D.M. Smith. Multiple-Precision Gamma Function and Related Functions [J].Transactions on Mathematical Software,2001,27:377~387.
    [68] D. M. Smith. Multiple-Precision Exponential Integral and Related Functions [J].Transactions on Mathematical Software,2011,37:1~18.
    [69] N. Demiris and P.D. O’Neill. Computation of final outcome probabilities for thegeneralised stochastic epidemic [J]. Statistics and Computing,2006,16(3):309~317.
    [70] APFLOAT library. http://www.apfloat.org/.
    [71] D.M. Smith. A Fortran Package For Floating-point Multiple-PrecisionArithmetic [J]. Transactions on Mathematical Software,1991,17:273~283.
    [72] APFLOAT-A C++high performance arbitrary precision arithmetic package.www.apfloat.org.
    [73] G. H. Choe and C. Kim. The Khintchine constants for generalized continuedfractions [J]. Applied Mathematics and Computational,2003,144:397~411.
    [74] P.W. Sharp. High order explicit Runge-Kutta pairs for ephemerides of the solarsystem and the Moon [J]. Journal of Applied Mathematics and Decision Sciences,2000,4(2):183~192.
    [75] NTL: A Library for doing Number Theory, http://shoup.net/ntl/.
    [76] ErkaySavas, T.A. Schmidt, and C.K. Koc. Generating Elliptic Curves of PrimeOrder [C], Cryptographic Hardware and Embedded Systems,2001:145~161.
    [77] P. Nguye. Cryptanalysis of the Goldreich-Goldwasser-Halevi Cryptosystem fromcrypto’97[C]. Annual international cryptology conference (CRYPTO'99),1999:288~304.
    [78] F. Johansson, MPMATH: A Python library for arbitrary precision floating-pointarithmetic (version0.13), http://code.google.com/p/mpmath/.
    [79] J. Arias de Reyna. High precision computation of Riemann's zeta function by theRiemann-Siegel formula. I. Math. Comput.2011,80(274):995~1009.
    [80] L. Jutier. A new basis set for molecular bending degrees of freedom. J. Chem.Phys.,2010,133(3).
    [81] MPIR: Multiple Precision Intergers and Rationals. http://mpir.org/index.html.
    [82] M. Grau-Sánchez, à. Grau and M. Noguera. On the computational efficiencyindex and some iterative methods for nonlinear systems [J]. Journal ofComputational and Applied Mathematics,2011,236(1):1259~1266.
    [83] H. Stewenius, F. Schaffalitzky and D Nister. How hard is3-view triangulationreally [C]. Proceedings of Tenth IEEE International Conference on ComputerVision (ICCV2005),2005:686~693.
    [84] J. Gunnels, J. Lee, and S. Margulies. Efficient high-precision dense matrixalgebra on parallel architectures for nonlinear discrete optimization [R].Technical Report IBM Research RC24682,2008.
    [85] Y. Hida, X. S. Li, and D. H. Bailey. Quad-double arithmetic: Algorithms,implementation, and application [R]. Report LBL-46996, Lawrence BerkeleyNational Laboratory, Berkeley, CA, pages1~28, October2000.
    [86] CLN: Class Library for Numbers, http://www.ginac.de/CLN/.
    [87] Z. Krougly and D.J. Jeffrey. Implementation and application of extendedprecision in Matlab [C]. Proceedings of Applied Computing Conference(ACC'09),2009:103~108.
    [88] E. M. Schwarz, et al. The S/390G5floating point unit supporting hex and binaryarchitectures [C]. Proceedings of ARITH1999,1999:836~841.
    [89] T.J. Slegel, R.M. Averill, M.A. Chech and B.C. Giamei. IBM’s S/390G5microprocessor design [J]. IEEE Micro,1999,19(2):12~23.
    [90] A. Akkas. Dual-mode floating-point adder architectures [C]. Journal of SystemsArchitecture-Embedded Systems Design,2008,54(12):1129~1142.
    [91] A. Akkas, M.J. Schulte. Dual-mode floating-point multiplier architectures withparallel operations [J]. Journal of Systems Architecture,2006,52(10):549~562.
    [92] A. Isseven and A. Akkas. A dual-mode Quadruple precision floating-pointdivider [J]. Proceedings of Fortieth Asilomar Conference on Signals, Systemsand Computers,2006:1697~1701.
    [93] M. G k and M.M. zbilen. Multi-functional floating-point MAF designs withdot product support [J]. Microelectronics Journal,2008,39(1):30~43.
    [94] L. Huang, S. Ma, L. Shen, Z. Wang and N. Xiao. Low cost binary128floating-point FMA unit design with SIMD support [J]. IEEE Transaction onComputer,2011.
    [95] D. Tan, C.E. Lemonds, and M.J. Schulte. Low-power multiple-precision iterativefloating-point multiplier with SIMD support [J]. IEEE Transactions onComputers,2009,58(2):175~187.
    [96] M.S. Cohen, T.E. Hull, and V.C. Hamarcher. Cadac: A controlled precisiondecimal arithmetic unit [J]. IEEE Transactions on Computers,1983, C-32:370-377.
    [97] D.M. Chiarulli, W.G. Ruaa, and D A. Buell. Draft: A dynamically reconfigurableprocessor for integer arithmetic [C]. Proceedings of the7th symposium oncomputer arithmetic,1985:309~318.
    [98] T.M. Carter. Cascade: Hardware for high/variable precision arithmetic [C].Proceedings of the9th symposium on computer arithmetic,1989:184~191.
    [99] M.J. Schulte and E.E.S. Jr. A family of variable-precision, interval arithmeticprocessors [J]. IEEE Transactions on Computers,2000,49(5):387~397.
    [100] J. Hormigo and J. Villalba. A hardware algorithm for variable-precision division[C]. Proceedings of the4th Conference on Real numbers and Computers,2000:1~7.
    [101] J. Hormigo, J. Villalba, and M. Schulte. A hardware algorithm forvariable-precision logarithm [C]. Proceedings of IEEE International Conferenceon Application-Specific Systems, Architectures, and Processors (ASAP2000),2000:215~224.
    [102] J. Hormigo, J. Villalba, and E. L.Zapata, Interval sine and cosine functionscomputation based on variable-precision cordic algorithm [C]. Proceedings ofthe14th symposium on computer arithmetic,1999,186~193.
    [103] J. Hormigo, J. Villalba, and E. L.Zapata. Cordic processor for variable-precisioninterval arithmetic [J]. Journal of VLSI signal processing,2004,37:21~39.
    [104] NVIDIA Corporation. NVIDIA Tesla GPU Computing Revolutionizing Highperformance Computing.2010.http://www.nvidia.cn/docs/IO/43395/tesla-brochure-12-lr.pdf.
    [105] Peter N. Glaskowsky. NVIDIA's Fermi: The First Complete GPU ComputingArchitecture.http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAs_Fermi-The_First_Complete_GPU_Architecture.pdf
    [106] A. Thall. Extended-precision floating-point numbers for GPU computation [C].Proceedings of ACM SIGGRAPH2006.
    [107] D. Mukunoki and D. Takahashi. Implementation and evaluation of Quadrupleprecision BLAS on GPU [J]. Performance Evaluation,2010,66:10~13.
    [108] N. Emmart and C. Weems. High precision integer multiplication with a graphicsprocessing unit [C]. Proceedings of IEEE Intenational Symposium on ParallelDistributed Processing Workshops and Phd Forum (IPDPSW),2010:1~6.
    [109] N. Emmart and C. Weems. High Precision Integer Addition, subtraction andMultiplication with a Graphics Processing Unit [J]. Parallel Processing Letters,2010,20(4):293~306.
    [110] N. Emmart and C.C. Weems. High-precision integer multiplication with a GPUusing Strassen's algorithm with multiple FFT sizes [J]. Parallel ProcessingLetters,2011,21(3):359~375.
    [111] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander.Relational query coprocessing on graphics processors [J]. ACM Transaction onDatabase System,2009,34(4):1~39.
    [112] G. Lienhart, A. Kugel, and R. M nner. Using Floating-Point Arithmetic onFPGAs to Accelerate Scientific N-body Simulations [C]. Proceedings of IEEESymposium on FPGAs for Custom Computing Machines (FCCM’02),2002:182~191.
    [113] A.A. Gaffar, W. Luk, P.Y.K. Cheung, and N. Shirazi. CustomisingFloating-Point Designs [C]. Proceedings of IEEE Symposium on FPGAs forCustom Computing Machines (FCCM’02),2002:315~317.
    [114] Y. Dou, S. Vassiliadis, G. Kuzmanov, and G. Gaydadjiev.64-bit Floating-PointFPGA Matrix Multiplication [C]. Proceedings of the13th ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays (FPGA’05),2005:86~95.
    [115] L. Zhuo and V.K. Prasanna. Sparse Matrix-Vector Multiplication on FPGAs [C].Proceedings of the13th ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA’05),2005:63~74.
    [116] M. deLorimier and A. DeHon. Floating-Point Sparse Matrix-Vector Multiply forFPGAs [C]. Proceedings of the13th ACM/SIGDA International Symposium onField Programmable Gate Arrays (FPGA’05),2005:75~85.
    [117] S. Banescu, F.de Dinechin, B. Pasca, and R. Tudoran. Multipliers forfloating-point double precision and Beyond on FPGAs [J]. ACM SIGARCHComputer Architecture News,2010,38(4):73~79.
    [118] C. He, G. Qin, R.E. Ewing and W. Zhao. High-Precision BLAS onFPGA-enhanced Computers [C]. In Proceedings of ERSA'2007,2007:107~116.
    [119] FloPoCo. http://flopoco.gforge.inria.fr.
    [120] A.F. Tenca and M.D. Ercegovac. A Variable Long-Precision Arithmetic UnitDesign for Reconfigurable Coprocessor Architectures [C]. In IEEE Symposiumon Field-Programmable Custom Computing Machines (FCCM’98),1998.
    [121] R. Parthasarathi, E. Raman, K. Sankaranarayanan and L.N. Chakrapani. AReconfigurable Co-Processor for Variable Long Precision Arithmetic UsingIndian Algorithms [C]. Proceedings of the9th Annual IEEE Symposium onField-Programmable Custom Computing Machines (FCCM’01),2001
    [122] E. Ej-Araby, I. Gonzalez, T. EI-Ghazawi. Bringing High-PerformanceReconfigurable Computing to Exact Computations [C]. Proceedings in FieldProgrammable Logic and Applications (FPL’07),2007:79~85.
    [123] J. Fisher, P. Faraboschi, and C. Young. Embedded Computing: A VLIWApproach to Architecture, Compilers and Tools [M]. Morgan Kaufmann,2004.
    [124] J.A. Fisher. Very long instruction word architectures and the ELI-52[C].Proceedings of the10th. Symposium Computer Architecture (ISCA’83),1983:140~150.
    [125] F.Paolo, B.Geoffrey, A.F. Joseph, et al. Lx: A Technology Platform forCustomizable VLIW Embedded processing [C]. Proceedings of the27th.Symposium Computer Architecture (ISCA’00),2000:203~213.
    [126] D. Burger and J.R. Goodman. Billion-Transistor Architectures: There and BackAgain [J]. IEEE Computer, March2004,22~28.
    [127] T. Kumura, M. Ikekawa, M. Yoshida and I. Kuroda. VLIW DSP for mobileapplication [C]. IEEE Signal Processing Mag.,2002:10~21.
    [128] Texas Instruments Inc. TMS320C6000: a High Performance DSP Platform.Available as: http://www.ti.com/sc/docs/products/dsp/c6000/index.htm.
    [129] TriMedia Processor Series [2012]. http://www.nxp.comp/.
    [130] J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder and R. Zahir. Introducing theIA-64Architecture [J]. IEEE Micro,2000,20(5).
    [131] C. Iseli and E. Sanchez. Spyder: A Reconfigurable VLIW Processor usingFPGAs [C]. Proceedings of IEEE Symposium on FPGAs for Custom ComputingMachines (FCCM’93), January1993:17~24.
    [132] C. Grabbe, M. Bednara, J.V.Z. Gathen, J. Shokrollahi, and J. Teich. A HighPerformance VLIW Processor for Finite Field Arithmetic [C], Proceedings of the17th International Symposium on Parallel and Distributed Processing(IPDPS’03), April2003.
    [133] F.D. Pereira, E.D.M. Ordonez2, and R.B. Chiaramonte. VLIW CryptoprocessorArchitecture and Performance in FPGAs [J], IJCSNS International Journal ofComputer Science and Network Security,2006,6(8A):151~160.
    [134] M. Koester, W. Luk, and G. Brown. A Hardware Compilation Flow forInstance-Specific VLIW Cores [C]. Proceedings of the18th InternationalConference on Field Programmable Logic and Applications (FPL’08), Sep2008.
    [135] A. Lodi, M. Toma, F. Campi, A. Cappelli, and R. Canegallo. A VLIW Processorwith Reconfigurable Instruction Set for Embedded Applications [J]. IEEEJournal on Solid-State Circuits,2003,38(11):1876~1886.
    [136] R.R. Hoare, A.K. Jones, D.Kusic, J.Fazekas, et. al. Rapid VLIW ProcessorCustomization For Signal Processing Applications Using CombinationalHardware functions [J]. In EURASIP Journal on Applied Signal Processing,2006:1~23
    [137] A.K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster. An FPGA-basedVLIW Processor with Custom Hardware Execution [C]. Proceedings of theACM/SIGDA13th Internal Symposium on Field Programmable Gate Arrays(FPGA’05),2005:107~117.
    [138] Trimaran, An Infrastructure for Research in Instruction Level Parallelism,http://www.trimaran.org.
    [139] V. Brost, F. Yang, and M. Paindavoine. A Modular VLIW Processor [C].Proceedings of the IEEE International Symposium on Circuits and Systems(ISCAS’07), Apr2007:3968~3971.
    [140] M.A.R. Saghir, M. El-Majzoub, and P. Akl. Customizing the Datapath and ISAof Soft VLIW Processors [C]. Proceedings of High Performance EmbeddedArchitectures and Compilers (HiPEAC’07),2007:276~290.
    [141] W.F. Lee. VLIW Microprocessor Hardware Design: on ASICs and FPGA [M].New York:McGraw-Hill,2008.
    [142] B. Mei, S. Vernalde, D. Verkest, H.D. Man and R. Lauwereins. ADRES: Anarchitecture with tightly coupled VLIW processor and coarse-grainedreconfigurable matrix [C]. Proceedings of Field Programmable Logic andApplication (FPL’03),2003:61~70.
    [143] Thijs van As. ρ-VEX: A Reconfigurable and Extensible VLIW Processor [D].Delft: Delft University of Technology,2008.
    [144] Hewlett-Packard Laboratories. VEX Toolchain.[Online]. Available:http://www.hpl.hp.com/downloads/vex/.
    [145] Wang, Q. Yu, Y. Hong, and C. Hou. SuperV back-end design flow based onAstro [C]. Proceedings on IEEE International Symposium on Communicationsand Information Technology (ISCIT’05),2005:1524~1527.
    [146] R. Colwell, J. O’Donnell, D. Papworth, and P. Rodman. Instruction StorageMethod with a Compressed Format using a Mask Word [P]. U.S. Patent5057837,Oct.1991.
    [147] B. Rau, D. Yen, W. Yen, and R. Towle. The Cydra5DepartmentalSupercomputer: Design Philosophies, Decisions, and Trade-offs [J]. IEEEComputer,1989:12~35.
    [148] M. Ros and P. Sutton. Compiler optimization and ordering effects on VLIWcode compression [C]. Proceedings on Int. Conf. Compilers, Architecture,Synthesis Embedded Syst.,2003:95~103.
    [149] S. Segars, K. Clarke, and L. Goudge. Embedded control problems, thumb, andthe ARM7TDMI [J]. IEEE Micro,1995,15(5):22~30.
    [150] R. Grehan.16-bit: The good, the bad, your options [J]. Embedded SystemsProgramming,1997.
    [151] IBM Corp. CodePack Compression for PowerPC. Available as:http://www.chips.ibm.com/products/powerpc/cores/cdpak.html
    [152] T. Bonny and J. Henkel. Huffman-based code compression techniques forembedded processors [J]. Journal ACM Transactions on Design Automation ofElectronic Systems (TODAES),2010,15(4):1~37.
    [153] N. Aslam, M.J. Milward, A.T. Erdogan and T. Arslan. Code compression anddecompression for coarse-grain reconfigurable architectures [J]. IEEETransaction on Very Large Scale Integration (VLSI) Systems,2008,16(12):1596~1608.
    [154] X. Yuan, W. Wolf and H. Lekatsas. Code compression for Embedded VLIWprocessors using variable-to-fixed coding [C]. Proceedings on15th InternationalSymposium on System Synthesis,2002,138~14.
    [155] K. Tanabe. Additive-form iterative refinement of LU factorization of anill-conditioned matrix [C]. Proceedings of the International Symposium onNonlinear Theory and its Applications (NOLTA2005),2005:737~740.
    [156] K. Ozaki, T. Ogita, S. Rump and S. Oishi. Accurate matrix multiplication byusing level3BLAS operation [C]. Proceedings of the2008InternationalSymposium on Nonlinear Theory and its Applications (NOLTA’08), Budapest,Hungary, IEICE,2008:508~511.
    [157] S. Graillat. Provably faithful evaluation of polynomials [C]. Proceedings of the2006ACM Symposium on Applied Computing, Dijon, France,2006.
    [158] S. Rump and H. Bohm. Least significant bit evaluation for arithmeticexpressions [J]. Computing,1983,30:89~199.
    [159] K. Ozaki, T. Ogita, S. Rump and S. Oishi. Fast and robust algorithm forgeometric predicates using floating-point arithmetic [J]. Trans. Japan Soc. Ind.Appl. Math.,2006,4:553~562.
    [160] E. Kaltofen, B. Li, Z. Yang and L. Zhi. Exact certification of global optimality ofapproximate factorizations via rationalizing sums-of-squares with floating pointscalars [C]. Proceedings of the International Symposium on Symbolic andAlgebraic Computation (ISSAC’08),2008:155~163.
    [161] F. Ordonez and R.M. Freund. Computational experience and the explanatoryvalue of condition measures for linear optimization [J]. SIAM J. Optim.,2003,14:307~333.
    [162] U. Kulisch. The fifth floating-point operation for top-performance computers [R].Universitat Karlsruhe,1997.
    [163] W. Edmonson and G. Melquiond. IEEE interval standard working group-P1788:Current status [C]. Proceeding of the IEEE Symposium on computer arithmetic(ARITH’09),2009:183~190.
    [164] U. Kulisch and V. Snyder. The exact dot product as basic tool for long intervalarithmetic [J]. Computing,2011,91(3):307~313.
    [165] U. Kulisch. Very fast and exact accumulation of products [C]. Computing,2011,91(4):397~405.
    [166] S. M. Rump, T. Ogita, S. Oishi. Accurate floating-point summation part ii: Sign,faithful and rounding to nearest [J]. SIAM Journal on Scientic Computing(SISC),2008,31(2):1269~1302.
    [167] S. M. Rump, T. Ogita, S. Oishi. Accurate floating-point summation part i:Faithful rounding [J]. SIAM Journal on Scientic Computing (SISC),2008,31(1):189~224.
    [168] S.M. Rump. Ultimately fast accurate summation [J]. SIAM Journal on ScientificComputing,2009,31(5):3466~3502.
    [169] C. He, G. Qin, M. Lu, W. Zhao. Group-alignment based accurate floating-pointsummation on FPGAs [C]. Proceedings of ERSA2006,2006:136~142.
    [170] U. Kulisch. Computer arithmetic and validity: theory, implementation, andapplications [M]. New York: DeGruyter, Berlin,2008.
    [171] A. Knofel. A fast hardware units for the computation of accurate dot products[C]. Proceedings of Proceeding of the IEEE Symposium on computer arithmetic(Arith’91), Grenoble, France,1991:70~74.
    [172] Y. Dou, Y. Lei, G. Wu and S. Guo. FPGA Accelerating Double/Quad-DoubleHigh Precision Floating-point Applications for Exascale Computing [C].Proceedings of24th International Conference on Supercomputing (ICS’10),Tsukuba, Japan,2010:325~336.
    [173] Intel Compilers and Libraries.http://software.intel.com/en-us/articles/intel-compilers/.
    [174] Intel VTune performance analyzer tools.http://software.intel.com/en-us/intel-vtune/.
    [175] Y. Dou, J. Zhou, G. Wu, J. Jiang and Y. Lei. A unified co-processor architecturefor matrix decomposition [J]. Journal of Computer Science and Technology(JCST),2010,25(4):874~885.
    [176] K.K. Parhi, H.R. Srinivas. A fast radix-4division algorithm and its architecture[J]. IEEE Transactions on Computers,1995,44(6):826~831.
    [177] Y. Li and W. Chu. Parallel-array implementations of a non-restoring square rootalgorithm [C]. IEEE International Conference on Computer Design,1997.
    [178] List of CPU power dissipation.[Online2012]. Available:http://en.wikipedia.org/wiki/List_of_CPU_power_dissipation.
    [179] G. Paul and M.W. Wilson. Should the elementary functions be incorporated intocomputer instruction sets?[J]. ACM Transactions on Mathematical Software,1976,2(2):132~142.
    [180] M. Ercegovac. Radix-16evaluation of certain elementary functions [J]. IEEETransactions on Computers,1973, C-22(6):561~566.
    [181] P. Farmwald. High-bandwidth evaluation of elementary functions [C].Proceeding of5th Symposium on Computer Arithmetic (Arith’81),1981:139~142.
    [182] W.F. Wong and E. Goto. Fast evaluation of the elementary functions in singleprecision [J]. IEEE Transactions on Computers,1995,44(3):453~457.
    [183] C. S. Anderson, S. Story, and N. Astafiev. Accurate math functions on the intelIA-32architecture: A performance driven design [J]. Proceedings of7thConference on Real Numbers and Computers,2006:93~105.
    [184] J. Detrey, F. D. Dinechine, and X. Pujol, Return of the hardware floating-pointelementary function [C]. Proceedings of the18th IEEE Symposium on ComputerArithmetic (ARITH’07),2007.
    [185] F.D. Dinechine, The arithmetic operators you will never see in a microprocessor[C]. Proceedings of IEEE Symposium on Computer Arithmetic (ARITH’11),2011:189~190.
    [186] F.de Dinechin, M. Joldes, and B. Pasca. Automatic generation ofpolynomial-based hardware architectures for function evaluation [C]. Proceedingof21st IEEE International Conference on Application-specific Systems,Architectures and Processors (ASAP’10),2010:216~222.
    [187] F.de Dinechin and B. Pasca. Designing Custom Arithmetic Data Paths withFloPoCo [J]. IEEE Design and Test of Computers,2011,28(4):18~27.
    [188] S.F. Oberman and M.J. Flynn. Design issues in division and other floating-pointoperations [J]. IEEE Transaction on Computers,1997,46(2):154~161.
    [189] M. F. Henry, NORAD SGP4/SDP4implementations,[online].http://www.zeptomoby.com/satellites/.
    [190] D.A. Vallado, P.Crawford, R. Hujsak, T.S. Kelso. Revisiting space track report#3[C]. Proceedings of AIAA Astro dynamics Specialists Conference andExhibit,2006.
    [191] Computational complexity of mathematical operations,[online].http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations.
    [192] R. Pottathuparambi and R. Sass. A parallel/vectorized double-precisionexponential core to accelerate computational science applications [C].Proceedings of ACM International Symposium on Field Programmable GateArrays (FPGA’09),2009.
    [193] Boost C++libraries,[online]. http://live.boost.org/.
    [194] H. Bao, J. Bielak, O. Ghattas, L.F. Kallivokas, D.R. O'Hallaron, J.R. Shewchukand J. Xu. Large-scale simulation of elastic wave propagation in heterogeneousmedia on parallel computers [J]. Computer Methods in Applied Mechanics andEngineering,1998,152(1-2):85~102.
    [195] J. Detrey and F. de Dinechin. A parameterized floating-point exponentialfunction for FPGAs [C]. Proceeding of IEEE Field-Programmable Technology(FPT’05),2005.
    [196] J. Detrey and F. de Dinechin. A parameterizable floating-point logarithmoperator for FPGAs [C]. Proceedings of39th Asilomar Conference on Signals,Systems&Computers.2005.
    [197] J. Detrey and F. de Dinechin. Table-based polynomials for fast hardwarefunction evaluation [C]. Proceedings of Application-specific Systems,Architectures and Processors (ASAP’05),2005:328~333.
    [198] J. Detrey and F. de Dinechin. Floating-point trigonometric functions for FPGAs[C]. Proceedings of Field-Programmable Logic and Applications (FPL’07),2007:29~34.
    [199] J. Detrey and F. de Dinechin. Parameterized floating-point logarithm andexponential functions for FPGAs [J]. Microprocessors and Microsystems,Special Issue on FPGA-based Reconfigurable Computing,2007,31(8):537~545.
    [200] F.de Dinechin, M. Joldes, B. Pasca, and G. Revy. Multiplicative square rootalgorithms for FPGAs [C]. Proceedings of Field-Programmable Logic andApplications (FPL’10),2010:574~577.
    [201] F.de Dinechin and B. Pasca. Floating-point exponential functions forDSP-enabled FPGAs [C]. Proceedings of Field Programmable Technologies(FPT’10),2010.
    [202] J.W. Michael, S.P. Daniel. OpenFPGA CoreLib core library interoperabilityeffort [J]. Journey of Parallel Computing,2008,34(4):231~244.
    [203] R. Bruce, M. Devlin and S. Marshall. An Elementary Transcendental FunctionCore Library for Reconfigurable Computing [R]. RSSI, Urbana-Champaign,2007.
    [204] X. Wang, S.Braganza and M. Leeser. Advanced components in the variableprecision floating-point library [C]. Proceedings of12th IEEE Symp.Field-Programmable Custom Computing Machines (FCCM’06),2006:249~258.
    [205] M.Wielgosz, E.Jamro and K.Wiatr. Highly Efficient Structure of64-BitExponential Function Implemented in FPGAs [C]. Proceedings of4th Int’l Conf.Reconfigurable computing: architectures, tools and applications (ARC’08),2008:274-279.
    [206] E. Jamro and K. Wiatr and M. Wielgosz. FPGA implementation of64-bitexponential function for HPC [C]. Proceedings of Int’l Conf. FieldProgrammable Logic and Applications (FPL),2007:718~721.
    [207] C.C. Doss and R.L. Riley. FPGA-Based implementation of a Robust IEEE-754Exponential unit [C]. Proceedings of12th IEEE Symp. Field-ProgrammableCustom Computing Machines (FCCM),2004:229~238.
    [208] N. Alachiotis and A. Stamatakis. FPGA optimizations for a pipelinedfloating-point exponential unit [C]. Proceedings of7th Int’l Conf.Reconfigurable computing: architectures, tools and applications (ARC),2011:316~327.
    [209] N. Alachiotis and A. Stamatakis. A Vector-Like Reconfigurable Floating-PointUnit for the Logarithm [J]. Int’l J. Reconfigurable Computing,2011:1~12.
    [210] J.M. Muller. Elementary Functions: Algorithms and Implementation (2nd edition)[M]. Basel, Switzerland:Birkhauser,2006.
    [211] P. Markstein. IA-64and Elementary functions: Speed and precision [M].Hewlett-Packard Professional Books. Prentice Hall,2000.
    [212] Behrooz Parhami. Computer Arithmetic: Algorithms and Hardware Design [M].Oxford, USA: Oxford University Press,1999.
    [213] M. Anane, H. Bessalah, M. Issad, N. Anane, and H. Salhi. Higher radix andredundancy factor for floating-point SRT division [J]. IEEE Transaction on VeryLarge Scale Integration system,2008,16(6):774~779.
    [214] J. E. Volder. The CORDIC trigonometric computing technique [J]. IRE Trans.Electron. Computers,1959,8:330-334.
    [215] J. S. Walther. A unified algorithm for elementary functions [C]. Proc. AFIPSConf,1971,38:379-385.
    [216] J. Zhou, Y. Dou, Y. Lei, and J. Xu. Double precision hybrid-mode floating-pointFPGA CORDIC coprocessor [C]. Proceedings of HPCC2008:182~189.
    [217] N. Takagi, T. Asada, S. Yajima. Redundant CORDIC Methods with a ConstantScale Factor for Sine and Cosine Computation [J]. IEEE Trans. Computers,1991,40:989~995
    [218] V.J. hlmann and M. Parhi. Efficient mapping of Cordic algorithm on FPGA [C].Signal Processing Systems,2000:336~345
    [219] T. Lang and P. Montuschi. Very-High Radix Square Root with Prescaling andRounding and a Combined Division/Square Root Unit [J]. IEEE Transactions onComputers,1999:827–841.
    [220] E. Antelo, T. Lang, and J. D. Bruguera. Very-High Radix CORDIC Vectoringwith Scalings and Selection by Rounding [C]. Proceedings of14th Intl.Symposium on Computer Arithmetic (ARITH’99),1999:204~213.
    [221] P. Soderquist, M. Leeser. Division and Square Root: Choosing the RightImplementation [J]. IEEE Micro,1997,17(4):56-66.
    [222] M. D. Ercegovac, L. Imbert, et al. Improving Goldschmidt Division, SquareRoot and Square Root Reciprocal [J]. IEEE Transactions on Computers,2000,49(7):759–763.
    [223] V.Lefevre, J.-M. Muller, A. Tisserand. Towards correctly roundedtranscendentals [J]. IEEE Trans. Computers,1998,47(11):1235~1343.
    [224] F.de Dinechin, J.-M. Muller, B. Pasca, and A. Plesco. An FPGA architecture forsolving the Table Maker’s Dilemma [C]. Proceedings of the22nd IEEEInternational Conference on Application-specific Systems, Architectures andProcessors (ASAP2011).
    [225] M.Schulte and E.E. Swartzlander. Hardware designs for exactly roundedelementary functions [J]. IEEE Transactions on Computers,1994,43(8):964~973.
    [226] V. Lefevre and J.-M. Muller. Worst cases for correct rounding of the elementaryfunctions in double precision [C]. Proceedings of18th IEEE SymposiumComputer Arithmetic (Arith’01).
    [227] R.P. Brent and P. Zimmermann. Modern computer arithmetic [M]. CambridgeUniversity Press,2010.
    [228] D.D. Sarma and D.W. Matula. Measuring the accuracy of ROM reciprocal tables[J]. IEEE Trans. Computers,1994,43(8).
    [229] M.M. Daniel, F.S. Diego, H.L. Carlos, and A.R. Mauricio. Tradeoff of FPGADesign of a floating-point library for arithmetic operators [J]. J. IntegratedCircuits and Systems,2010,5(1):42-52.
    [230] G. Wu, Y. Dou, J. Sun and G. D. Peterson. A High Performance and MemoryEfficient LU Decomposer on FPGAs [J]. IEEE Trans. Computers,2012,61(3):366~378.
    [231] G. Govindu, R. Scrofano, and V. K. Prasanna. A library of parameterizablefloating-point cores for FPGAs and their application to scientific computing [C].Proceedings of Int’l Conf. Engineering Reconfigurable Systems and Algorithms,2005:137~148.
    [232] F. de Dinechi. A flexible floating-point logarithm for reconfigurable computers.Lip research report RR2010-22, ENS-Lyon,2010.http://prunel.ccsd.cnrs.fr/docs/00/50/61/22/PDF/RR-2010-22.pdf.
    [233] Y. Dou, J. Zhou, Y. Lei, and X. Zhou. FPGA SAR Processor with WindowsMemory Accesses. Proceeding of the18thInternational Conference ofApplication-Specific Systems, Architectures and Processors (ASAP’07).2007:95-100.
    [234] X. Niu and Y. Ban, Radarsat-2polarimetric SAR data for urban land covermapping using spatial-temporal SEM algorithm and mixture models,proceedings of JURSE2011, pp.241-244.
    [235] J.L. Hennessy and D.A. Patterson. Computer Architecture: A QuantitativeApproach (Fourth Edition)[M]. Elsevier Inc.2007.
    [236] The IMPACT Research Group. http://www.crhc.uiuc.edu/IMPACT/.
    [237]邓宇.基于图着色的存储层次优化技术研究[D].博士学位论文,国防科学技术大学,2007.
    [238] M. Fürer. Faster Integer Multiplication [C]. Proceedings of the39th AnnualACM Symposium on Theory of Computing,2007:55~67.
    [239] N. Shibata. Efficient evaluation methods of elementary functions suitable forSIMD computation [J]. Computer Science–Research and Development,2010,25(1):25~32.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700