浮点处理单元设计关键技术研究与实现

英文题名：Research and Implementation of Key Techniques of High Performance Floating-Point Unit Designs
作者：陈芳园
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：浮点处理单元 ; 二分检测法 ; BOOTH算法 ; 树型乘法器 ; 浮点乘法器 ; 浮点除法器
英文关键词：FPU ; Binary-system Detection ; BOOTH Algorithm ; Tree Multipliers Floating-point Multiplier ; Floating-point Divider
学位年度：2008
导师：刘芸
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2007-11-01

摘要

近60年来,随着微电子技术和集成电路工艺的飞速进步,微处理器有了惊人的发展,性能迅速得到提高。与此同时,要想满足微处理器高性能的要求,关键路径上的浮点处理单元(FPU)的运算速度必须足够快。
     现有的一些处理器中的浮点处理单元基本上取得了很好的性能,但也存在一些问题。在浮点处理单元中,浮点乘法开始向着更高进制、更大位宽、并行度更高的乘算法发展,因此乘法器的速度和面积直接影响着整个浮点处理单元的性能,需要对乘法器的设计进行改进和优化以提高整个浮点处理单元的性能。同时,浮点处理单元中的除法、平方根等使用频度较低的运算仍然是整个单元的性能瓶颈,其运算结构比较复杂,处理单元的面积和功耗也比较大。
     针对上述问题本文研究了浮点处理单元设计相关关键技术。针对浮点乘法部分积产生规则提出了伪1变换,优化其控制通路;同时在传统的Wallace树型乘法器中提出了预伪加的方法,不仅减少了部分积累加延迟,也降低了电路的复杂性;在乘法器的基础上本文结合查找表法和Goldschmidt算法进行了浮点除法的设计实现,并通过控制电路实现了FPU的顺序执行,乱序流出,充分挖掘了FPU的资源利用率。运用这些设计技术本文设计实现了一个浮点处理单元,并对其进行了性能分析和测试,验证了本文提出的设计技术的有效性和正确性。
     1.首先,本文对浮点处理单元中的关键部件浮点加法进行了分析。在双通路(Two-Path)算法结构的基础上,针对浮点加法运算中的延迟比较大的结果规格化过程,运用前导零检测算法的二分检测法对这一问题提出了解决方案,进行了前导零检测设计,缩短了延迟,简化了电路设计。
     2.其次,针对64位乘法,优化浮点乘法部分积生成电路中的控制通路,提出了部分积产生规则控制通路的伪1变换策略来降低延迟,简化了电路设计,减小了面积和功耗。
     3.同时,在传统的Wallace树型乘法器中,引入了部分积压缩阵列过程中的进位预取和低位舍去策略,提出了预伪加的方法,不仅减少了延迟,也降低了电路复杂性。结合流水线设计技术,这种改进的设计方案能够在单周期内完成单精度或双精度浮点乘法,满足了快速三维图形计算、高速浮点处理单元对性能的较高要求。
     4.在实现了浮点乘法流水部件的基础上,结合查表法和Goldschmidt算法对浮点除法进行了设计实现。
     5.基于上述浮点关键部件的实现,将各个浮点运算的流水进行控制实现。结合浮点除法中的迭代控制信号对FPU进行顺序执行,乱序流出设计实现。充分利用了FPU的资源,提高了FPU的性能。
     最后,本文在上述高性能浮点处理单元设计关键技术研究的基础上,设计实现了一款高性能浮点处理器,对本文提出的各种关键技术进行了实现。通过测试和仿真,测试结果表明本文设计的浮点处理器在性能、面积上均可满足要求。
Since the recent 60 years,there has been a rapid progress in the microelectronics and integrated circuits. Under the circumstance, microprocessors have amazing development and the performance has been improved rapidly. At the same time, in order to meet the requirements of high-performance, floating-point processing unit (FPU), the critical unit, must have fast speed .
     Some of the existing floating-point processing unit basically achieved good performance, but there are still some problems. In the FPU, the floating point multiplication algorithm started moving towards a higher band, greater interface and higher degree of parallel. The speed and area of Multiplier directly impacts on the performance of FPU, so the multiplier design need to be improved and optimized.FPU in the division and square root is still computing the performance bottleneck, whose structure is more complex and also has large area and quite big power.
     To address the problem, this paper studies the key technologies of FPU. Against the 64-bit multiplication we propose Pseudo-1 transformation strategy in the partial product generation circuit, which optimize control pathway. Meanwhile, in the traditional Wallace tree multiplier we propose the pseudo-plus approach , which not only reduces the delay, but also reduces the complexity of the circuit. On the basis of floating point multiplier realization, this paper implements the design of floating-point division using the Goldschmidt and look-up table method. We implement a FPU with order execution and chaotic sequence outflow. The design takes full advantage of the FPU resources and improves the performance. With all these design technologies this paper designes and implementes a high-performance FPU. And with the analysis and testing, we prove the design techniques and the effectiveness of correctness.
     First, floating point adder, the key parts, is analyzed in this paper. On the basis of the In dual-channel (Two-Path) structure, this paper studies the normalize process of floating-point adder . Then we put forward solutions to the problem using leading zero detection algorithm. The design shorteres delays and simplifies the circuit design.
     Secondly, against the 64-bit multiplication, we set optimal control pathway in the partial product generation circuit and proposed Pseudo-1 transformation strategy to reduce delays, simplify the circuit design and reduce the size and power consumption.
     At the same time, in the traditional Wallace tree multiplier, we introduce the carry-Prefetching and low-Round Strategy in the process of compression array. This paper proposes the pseudo-plus approach,which not only reduces the delay, but also reduces the complexity of the circuit. Combining pipeline design technology, this design can complete a single or double-precision floating-point precision multiplication.It meets the rapid calculation of 3-D graphics and high performance requirements of FPU.
     Fourthly, on the realization basis of floating point multiplication ,this paper implements the design of floating-point division using the Goldschmidt and look-up table method.
     Fifthly, based on the above realization of the key components, we implement a FPU with order execution and chaotic sequence outflow. The design takes full advantage of the FPU resources and improves the performance.
     Finally, after studying key technologies, this paper designs and implements a high-performance FPU, which realizes the various technologies presented in the paper. Through testing and simulation, test results show that the FPU satisfies the requirements of performance, power and area.

引文

[1]8086/8088 User's Manual Programmer's and Hardware Reference.Intel Corp.,1989,Order No.24084-001
    [2]P.M.Farmwald.On the Design of' High Performance Digital Arithmetic Units.Ph.D.thesis,Stanford University,Aug.1981
    [3]N.T.Quach.Reducing the Latency of Floating-Point Arithmetic Operations.Ph.D.thesis,Stanford University,Dec.1994
    [4]S.F.Oberman.Design Issues in High Performance Floating-Point Arithmetic Units.Ph.D.thesis,Stanford University,Jan.1997
    [5]J.D.Bruguera and T.Lang.Multilevel Reverse Most-Significant-Carry Computation.Intemal Report,University of Santiago de Compostela,2000
    [6]J.D.Brugucra and T.Lang.Multilevel Reverse-Carry Addition:Single and Dual Adders.Internal Report,University of Santiago de Compostela,2000
    [7]J.D.Bruguera and T.Lang.Leading-One Prediction Scheme for Latency Improvement in Single Datapath Floating-Point Adders.Internal Report,University of Santiago de Compostela,2000
    [8]J.D.Bruguera and T.Lang.Leading-One Prediction with Concurrent Position Correction.Internal Report,University of Santiago de Compostela,2000
    [9]J.D.Bruguera and T.Lang.Rounding in Floating-Point Addition Using A Compound Adder.Internal Report,University of Santiago de Compostela,2000
    [10]刘若衍.双精度浮点运算单元的设计.中科院数学与系统研究所硕士学位论文,2001
    [11]王迎春.浮点算法研究及可配置浮点处理器的设计与实现.西北工业大学博士学位论文,1999
    [12]A.Tyagi.A Reduced Area Scheme for Carry-Select Adders.IEEE Transaclions on Computers,1993,Vol.42,No.10:1163-1170
    [13]P.K.Chan and Schlag M.D.F.Analysis and Design of CMOS Manchester Adders with Variable Carry Skip.IEEE Transactions on Computers,1990,Vol.39,No.8..983-992
    [14]VGOklobdzija,D.Villeger and S..S.Liu.A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using An Algorithmic Approach.IEEE Transactions on Computers,1996,Vol.45,No.3:294-306
    [15]R.Yu and GZyner.167 MHz Radix-4 Floating Point Multiplier.Proceedings of the 12th Symposium on Computer Arithmetic,Jul.1995
    [16]D.A.Patterson and J.L.Henessy.Computer Architecture:A Quantitative Approach. Morgan Kaufinann Publishers,second edition,1996
    [17]S.Waser and M.Flynn.Introduction to Arithmetic for Digital Systems Designers.Holt,Rinehart and Winston,1982
    [18]P.M.Farmwald.On the Design of High Performance Digital Arithmetic Units.PH.D.thesis,Stanford University,Jan.1997
    [19]M.Dumas and D.W.Matula.Rcoders for partial compression and Rounding.Research Reprt:NO97-01,Jan.1997
    [20]M.Dumas and D.W.Matula.Futherr educing the redundancy of a notation over a minimally redundant digit set.Research report 3886,Institute National de Re erche en informatique et en Automatique,le Chesnay,France,2000
    [21]Peter Kronerup.Digit-setconversion:generalizations and applications.IEEE Transactions on Computers,1994,Vol.43(5):622-629
    [22]E.Hokenek and R.K.Montoye.Leading-Zero Anticipator(LZA) in the IBM RISC System/6000 Floating-Point Execution Unit.IBM Journal Research and Development,1990,Vol.34,No.1:71-77
    [23]N.T.Quach and M.J.Flynn.Leading-One Prediction,Implementation,Generalization and Application.Technical Report CSL-TR-91-463,Stanford University,1991
    [24]B.J.Benschneider et al..A Pipelined 50-MHz CMOS 64-bit Floating-Point Arithmetic processor.IEEE Journal of Solid-StateCircuits,Oct.1989,Vol.24,No.5:1317-1323
    [25]M.Birman,A.Samuels,G.Chu,T.Chuk,L.Hu,J.McLeod and J.Barnes.Developing the WTL 3170/3171 SPARC Floating-PointCoprocessor.IEEE Micro,Feb.1990,Vol.10,No.1:55-63
    [26]P.Y.Lu,A.Jain,J.Kung and P.H.Ang.A 32-Mflop 32b CMOS Floating-Point Processor.in Degest of Technical Papers,IEEE International Solid-State Circuits Conference,1988:28-29
    [27]A.D.BOOTH.A Signed binary multiplication technique[J].Quarterly Journal of Mechanics and Applied Mathematics,Mar.1951:236-240
    [28]C.S.Wallace.A Suggestion for a Fast Multiplier.IEEE Transactions on Electron Computers,1964:14-20,65-91
    [29]Stuart F.Oberman.Design Issue In High Performance Floating Point Arithmetic Units.Nov.1996
    [30]Stuart F.Oberman,Michael J.Flynn.Division Algorithms and Implementations.IEEE Transactions on computers,Aug.1997,Vo.46,NO 8
    [31]J.E.Robertson.A New Class of Digital Division Methods.IRE Transactions on Electronic Computing,EC-7(3),Sep.1958:88-92
    [32] K.D.Tocher. Techniques of Multiplication and Division for Automatic Binary Computers. Quarterly Journal of Mechanics and Applied Mathematics, 1958,11(3), :364-384
    [33] David L.Harris, Stuart F.Oberman and Mark,A Horowitz. SRT Division Architectures and Implementations . Proceedings of 13th IEEE International Symposium on Computer Arithmetic, Jul.1997: 18-25
    [34] Alberto Nannarelli, Tomas Lang. Low-Power Radix-8 Divider. Proceedings of International Conference on Computer Design(ICCD), 1998: 420-426
    [35] Hosahalli R.Srinvas and Keshab K.Parhi. A Fast Radix-4 Division Algorithm and its Arthitecture . IEEE Transactions on computers, 1995, Vol.44, No 6,
    [36] D.L.Fowlerand and J.E.Smith. An Accurate, High Speed Implementation Of Division by Reciprocal Approximation. Proc.Ninth IEEE Symp.Computer Arithmetic ,Sept. 1998: 60-67,
    [37] P.W.Markstein. Computation of Elementary Function on the IBM RISC System/6000 Processor. IBM J.Research and Development, Jan. 1990: 111-119
    [38] R.C.Agarwal, F G Gustavson and M.S.Schmookler. Series Approximation Methods for Divide and Square Root in the Power3 TM Processor. Proc.14th IEEE Symp.on Computer Arithmetic, Apr. 1999: 116-123
    [39] S.F.Anderson, J.GEarle, R.E.Goldschmidt and D.M.Powers. The IBM Systern/360 Model 91:Floating-Point Execution Unit. IBM J.Research And Development,Jan.1967, vol.11: 34-53
    [40] H.Darley, M.Gill, D.Earl, D.Ngo, P.Wang,M.Hipona and J.Dodrill. Floating Point(Integer Processor with Divide and Square Root Functions. U.S.Patent No.4,1989: 878-190
    [41] M.D.Ercegovac, D.W.Matula, J.-M.Muler and G.Wei. Improving Goldschmidt Division , Square Root and Square Root Reciprocal. IEEE Transactions on Computers, Jul.2000, 49(7):759-763
    [42] Albert A.Liddicoat and Michael J.Flynn. Euromicro Symposium on Digital System Design. Warsaw, Poland, Sep.2001: 354-361

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700