64位高性能浮点乘法器的设计优化

英文题名：Optimization of A 64-bit High Performance Float-point Multiplier
作者：李晓静
论文级别：硕士
学科专业名称：软件工程
中文关键词：浮点乘法器 ; 提速 ; 半定制 ; 全定制 ; LEF ; LIB
英文关键词：float-point multiplier ; speed optimization ; semi-custom ; full-custom ; LEF ; LIB
学位年度：2010
导师：李少青
学科代码：080903
学位授予单位：国防科学技术大学
论文提交日期：2010-04-01

摘要

浮点乘法器结构复杂,逻辑计算延时较大,是影响高性能微处理器设计的瓶颈之一。更快更好的实现浮点乘法的逻辑计算,对提高处理器性能具有重要的意义。
     半定制实现方式已经满足不了越来越高的主频要求,为了达到设计目标,在考虑性能和工作量基础上,本文采用核心模块——部分积压缩和部分积累加全定制设计,总体采用半定制方法实现浮点乘法器,在不过多增加开销的情况下,能够有效提高浮点乘法器的速度。
     本文的研究成果主要有:
     1.提出了一种改进的实现4-2压缩器的结构,用于本文的压缩结构,与以前的结构相比延时减少了大约27.5%;
     2.全定制设计了4-2压缩器,其延时为0.11ns,与半定制实现的4-2压缩器延时0.18ns相比,延时减少了39%;
     3.在分析并行加法器的组加法器位数与进位树产生延时的关系的基础上,采用136位全并行的设计方法全定制实现了该加法器,其延时为0.30ns,使部分积累加模块总延时减少了21.3%。
     优化后的浮点乘法器在65nmCMOS工艺的典型(tt)情况下,性能由1.4GHz优化到1.8GHz,提高了大约30%。对浮点乘法器进行了后端物理设计,版图实现后为1.36GHz。
The performance of float-point multiplier is the bottleneck for the high performance microprocessor, because the architecture of float-point multiplier is very complex, and its latency of the circuit-implement is especially long. Optimize the speed of the implementation of float-point multiplier is very importment for the improvement of microprocessor. Semi-custom design can’t satisfy the more and more high frequency. In order to get to the target, partial product compression and accumulation is designed by full-custom. Optimizing the float-point multiplier by the method of combination of full-custom and semi-custom is effective.
     The fruit of studying is that:
     1. A novel 4-2 compressor is proposed in this paper is used in the compression, the latency is less 27.5 percentage than original 4-2 compressor;
     2. The latency of the 4-2 compressor designed by full-custom is 0.11ns, and the latency of the 4-2 compressor designed by semi-custom is 0.18ns. The latency of the 4-2 compressor designed by full-custom is less 39 percentage than the 4-2 compressor designed by semi-custom;
     3. Analyze the related the number of bit of team adder with latency of carry tree to give the method of implementing high speed 136-bit adder by all parallel no matter the Sum or Carry. And designed by full-custom, the latency is 0.30ns, making the latency of the partial product accumulation be less 21.3 percentage than that semi-custom.
     The synthesized frequency of optimized float-point multiplier is 1.8GHz based on 65nm technology, which increase 30 percentage than 1.4GHz designed by semi-custom. Physical design the float-point multiplier, and after placing and routing the frequency is 1.36GHz.

引文

[1]陈春章,艾霞,王国雄.数字集成电路物理设计.科学出版社, 2008.
    [2]国际半导体技术发展路线图.中国集成电路, 2009.6.
    [3]关于高性能微处理器的综述.中国龙芯论坛, 2007.4.
    [4] S. F. Oberman. Design issues in high performance floating point arithmetic units.Stanford University, Technical Report, 1996:9-10.
    [5] The Institute of Electrical and Electronics Engineers. IEEE Standard for BinaryFloating-Point Arithmetic, 1985.
    [6] BOOTH. A SIGNED BINARY MULTIPLICATION TECHNIQUE. QuaterlyJournal of Mechanics and Applied Mathematics, 1951.
    [7] O. L. McSorley. High-Speed Arithmetic in Binary Computer. Proc IRE,1961:67-91.
    [8] Kiseon Cho. 54x54-bit radix-4 multiplier based on modified booth algorithm.Proceedings of the 13th ACM Great Lakes symposium on VLSI, 2003.
    [9] P. M. Seidel, L. D. McFearin, D. W. Matula. Binary Multiplication Radix-32and Radix-256. Proceedings of the 15th IEEE Symposium on ComputerArithmetic, 2001.
    [10] M. Mehta, V. Parmar, E. Swartzlander. High-Speed Multiplier Design UsingMulti-Input Counter and Compressor Circuits. 10th IEEE Symposium onComputer Arithmetic, 1991.
    [11] C. S. Wallace. A Suggestion for a Fast Multiplier. IEEE Transaxtions onElectronic Computers, 1964.
    [12] L. Dadda. Some Schemes for Parallel Multipliers. Alta Frequenza, 1965.
    [13] Ivan D. Castellanos, James E. Stine. Compressor trees for decimal partialproduct reduction. Proceedings of the 18th ACM Great Lakes symposium onVLSI, 2008.
    [14] R. Zimmermmann. Low-Power Logic Styles: CMOS versus Pass-transistorLogic. IEEE of Solid-state Circuits, 1997:1079-1090.
    [15] Yajuan He, Chip-Hong Chang, Iangmin Gu. An area efficient 64-bit square rootcarry-select adder for low power applications. ISCAS, 2005.
    [16] O. Bedrij. Carry Select Adder. IRE Trans, on Electronic Computer,1962:340-346.
    [17] J. Sklansky. Conditional Sum addition logic. IRE Trans. Electron Computer,1960:226-231.
    [18] R. P. Brent, H. T. Kung. A Regular Layout for Parallel Adders. IEEETransactions on Computers, 1982.
    [19] P. M. Kogge, H. S. Stone. A Parallel Algorithm for the Efficient Solution of aGeneral Class of Recurrence Equations. IEEE Transactions on Computers,August 1973, vol.C-22:786-793.
    [20] M. Santoro, G. Bewick, M. Horowitz. Rounding Algorithms for IEEEMultipliers Symposium on Computer Arithmetic, 1989.
    [21] R. Yu, G. Zyner. 76MHz Radix-4 Floating Point Multiplier. IEEE Symposiumon Computer Arithmetic, 1995.
    [22] P. K. Montoye, E. Hokenek. Design of the IBMRIsc Systom/6000Floating-Point Exeution Unit. IBM Journal Research and Development, 1990.
    [23] M. R. Santor. Design and Clocking of Mutipliers. TR 89-397. StanfordUniversity, 1989.
    [24]郝志刚,曾献君.一种并行的sticky位计算方法.计算机工程与科学, 2006,28(4):124-129.
    [25] N. Ohkubo, M. Suzuki, T. Shinbo. A 4.4ns CMOS 54x54-b Multplier UsingPass-Transistor Multiplexer. IEEE Journal of Solid-State Circuit, 1995, Vol.30(3):251-257.
    [26] Y. Hagihara, S. Inui. A 2.7ns 0.25gm CMOS 54x54-b Multiplier. IEEEInternational Conference on Solid-State Circuit, 1998:296-297.
    [27] N. Itoh, Y. Naemura. A 600-MHz 54x54-bit multiplier with rectangular-styledWallace tree. IEEE Jounal of Solid-State Circuit, 2001, Vol.36:249-257.
    [28] R. Montoye, W. Belluomini. A double precision floating point multiply. IEEEInternational Solid-State Circuits Conference, 2003, Vol.1:336-337.
    [29] A. Vazquez, E. Antelo, P. Montuschi. A new family of high-performanceparallel decimal multipliers. In Proc. 18th IEEE Symp. Comput. Arithmetic,2007:195-204.
    [30] I. D. Castellanos, J. E. Stine. Decimal partial product generation architectures.In Proc. 51st Midwest Symp. Circuits Syst, Aug. 2008:962-965.
    [31] G. Jaberipur, A. Kaivani. Improving the speed of parallel decimal multiplication.IEEE Trans. Comput, Nov. 2009, vol. 58:1539-1552.
    [32] B. J. Hickmann, A. Krioukov, M. J. Schulte. A parallel IEEE P754 decimalfloating-point multiplier. In Proc. IEEE Int. Conf. Comput., Oct. 2007:296-303.
    [33] R. Raafat, R. Samy, T. ElDeeb. A decimal fully parallel and pipelined floatingpoint multiplier. In Proc. 42nd Asilomar Conf. Signals, Syst. Comput., Oct.2008:1800-1804.
    [34] Charles Tsen, Michael Schulte. A Combined Decimal and Binary Floating-pointMultiplier. 20th IEEE Intematinal Conference on Application-specific System,Architectures and Processors, 2009:8-15.
    [35]于敦山,沈绪榜.32位定/浮点乘法器设计.半导体学报. 2001, Vol.(22):91-95.
    [36]袁寿财,朱长纯.快速乘法器中高速4-2压缩器的设计.微电子学与计算机,2002(04).
    [37]唐志敏.一种快速的浮点乘法器结构.计算机研究与发展, 2003.6.
    [38]胡伟武,张齐.龙芯2号处理器功能部件设计.计算机研究与发展, 2006.43(6):967-973.
    [39]黎渊.高性能浮点乘加部件的研究与实现:硕士学位论文.长沙:国防科学技术大学, 2008.11.
    [40]赵忠民.64位高性能嵌入式CPU中乘法器单元的设计与实现:硕士学位论文.上海:同济大学, 2007.3.
    [41]张予器.超高精度浮点运算的关键技术研究:硕士学位论文.长沙:国防科学技术大学, 2005.11.
    [42] Raghuveer. A Parametric Approach to Bispec-trum Estimation AcousticsSpeedchand Signal Processing. IEEE International Conference on ICASsp apos,1984.
    [43] Sreehari Veeramachaneni, Lingamneni Avinash. Novel Architectures forHigh-Speed and Low-Power 3-2, 4-2 and 5-2 Compressors. IEEE 20thInternational Conference on VLSI Design, 2007.
    [44] Jan M. Rabaey, Borivoje Nikolic. Digital Integrated Circuits: A DesignPerspctive, 2004.10:431-438.
    [45]宋焕章,张春元,王保恒.计算机原理与设计——中央处理器.国防科技大学出版社, 2000:109-115.
    [46] A. Weinberger, J. L. Smith. A One-Microsecond Adder Using One-MegacycleCircuitry. IRE Transactions on Electronic Computers, 1956, vol.5:65-73.
    [47]孙岩.高性能算术逻辑部件研究与全定制设计:硕士学位论文.长沙:国防科学技术大学, 2005.11.
    [48] M. S. Schmooker, K. J. Nowka. Leading Zero Anticipation and Detection:AComparison of Methods, Proc IEEE 15th Symp on Computer Arithmetic,2001:7-12.
    [49]胡春媚,江东,马剑武.基于标准单元ASIC设计的综合优化综述.计算机工程与科学, 2005, Vol.27.No.4.
    [50] The VIS Instruction Set Version 1.0. http://www.sun.com, June 2002.
    [51] Viji Srinicasan, David Brooks, Mickael Gschwind, Pradip Bose. OptimizingPipeline for Power and Performance. 35th Annual IEEE/ACM InternationalSymposium on Microarchtecture, 2002:333-344.
    [52]张静波.高性能浮点乘加部件的优化设计:硕士学位论文.长沙:国防科学技术大学, 2007.11.
    [53] A. M. Shams, M. A. Bayoumi. A structured approach for designing low-poweradders. In Proc 31st Asilomar. Conf. Signals, Syst. Computers, 1997,vol.l:757-761.
    [54] Karuna Prasad. Low-Power 4-2 and 5-2 Compressors. IEEE, 2001:129-133.
    [55] SUZUKI, SHINBO, YAMANAKA. A 1.5-ns 32-b CMOS ALU in doublepass-transistor logic. IEEE. Solid State Circuits, 1993, v(28):1229-1236.
    [56] Chip-Hong Chang, Mingyan Zhang. Ultra Low-Voltage Low-Power CMOS 4-2and 5-2 Compressor for Fast Arithmetic Circuits. IEEE TRANSACTIONS ONCIRCUITS AND SYSTEM, 2004.10.
    [57] A. P. Chandrakasan, R. W. Brodersen. Low Power Digital CMOS Design.Norwell. MA: Kluwer, 1995.
    [58] J. M. Rabaey, A. Chandrakasan, B. Nikolic. Digital Integrated Circuits. PrenticeHall, 2003.
    [59] Johannes Grad, James E. Stine. A Standard Cell Library for Student Projects.http://www.ieee.com.
    [60]谭全林.指令缓存数据阵列的设计与实现:硕士学位论文.长沙:国防科学技术大学, 2009.3.
    [61]冯超超.半定制与全定制混合设计流程中验证方法研究:硕士学位论文.长沙:国防科学技术大学, 2007.11.
    [62]高正坤. X处理器的浮点部件设计与实现:硕士学位论文.长沙:国防科学技术大学, 2007.11.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700