600MHz YHFT-DX乘法部件的设计与验证
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
YHFT-DX是一款32位超长指令字结构的高性能定点DSP,CPU内核设置了两个独立的乘法部件,两个乘法部件功能和结构完全相同,并且都是流水实现,使得YHFT-DX具有很高的乘法性能,但其涉及的指令数量和种类较多,使得乘法部件内部结构比较复杂,这为600MHz的设计目标提出了挑战。
     本文根据YHFT-DX处理器的设计要求,在全定制与半定制混合设计方法的基础上,从系统级、模块级和电路级等方面对设计中影响时序、面积等关键因素进行了深入研究,最后完成了乘法部件的设计,达到600MHz的设计目标。本文的主要内容体现在以下几个方面:
     1.在深入分析乘法部件的功能及流水线结构的基础上,通过站间逻辑归并、同一化处理、逻辑前移等技术对同类流水线结构进行优化,不同类流水线结构之间共用站间寄存器,实现分时复用,节约硬件资源。
     2.全定制实现关键模块的设计。在设计过程中,采用分级分站、减少操作位数、逻辑分割、重组或转换技术对关键模块的结构进行优化;电路设计中除了采用常用的电路结构外,另设计大驱动能力的寄存器,以减少逻辑级数;版图设计时充分采用位片设计方法,源/漏共享,通道复用等多种设计技术减少长线互连和寄生参数。上述三个层次的优化确保了全定制模块的时序满足设计要求。
     3.完成包含全定制模块的逻辑综合与物理设计。根据全芯片布局完成乘法部件的布局规划、电源地规划以及时钟设计,并在时钟设计中引入“有用偏差”来平衡内部时序违例路径。
     整体设计采用130nm的CMOS工艺,完成后的面积为400×430μm~2。验证结果表明设计功能正确,且最长路径的延时为1.31ns,相比整体采用半定制设计方法的时序改进了37.5%,达到设计目标。
YHFT-DX is a32-bit fixed-point high performance DSP based on VLIW architecture. In YHFT-DX, there are two Multiply Units and both are pipelined, which make YHFT-DX has high multiplication performance. It’s a challenge to achieve the design goal of 600MHz for the Multiply Unit because of the numbers and abundant types of instructions, which made the internal structure complex.
     According to the design requirements of the YHFT-DX chip, this paper analyzes the critical factors which affects the timing and area from system level, module level and circuit level, and then implements Multiply Unit based on the mixed methodology. of Full Custom and Semi-custom. Finally the frequency of Multiplier Unit achieved 600MHz. The main contributions are as follows:
     1. Analyzing the function and pipeline of Multiply Unit, and three pipelines are adjusted and optimized by logic merging of different stage, the same treatment, logical move forward techniques and sharing registers.
     2. Implementing the design of critical modules based Full Custom methodology and optimizing the design with structural level, circuit level and layout level. In the design process, hierarchical sub-stations, reducing median operation,logical partitioning, reorganization or conversion technology to optimize the structure of key modules. In the circuit design besides uses the commonly used circuit structure, in addition designs a high driving capability register , reduces the logical progression. The layout of critical module are implemented and optimized, several layout methods such as slice-bit, source-drain share or route channel multiplexing and so on were introduced to reduces the long-line interconnection and the parasitic parameters. The above three level's optimization had guaranteed all custom-made module succession to satisfy the design requirements.
     3. Implementing logic synthesis and physical design.Completes Multiply Unit’s floorplan, powerplan and the clock design according to the entire chip, and introduced the“useful skew”in the clock design to be balanced the internal timing for violation path.
     The design uses 130nm CMOS process, and the total area is 400×430μm~2. The verification result indicated that delay of critical path is 1.31ns. Compared with the result of ASIC, the delay of the longest path reduced about 37.5%, received a very large improvement in performance.
引文
[1] Jennifer Eyre, Jeff Bier. the Evolution of DSP Processors. IEEE Signal Processing Magazine, March, 2000.
    [2] A Farooqui, V. Oklobdzija. General Data-Path Organization of A MAC Unit for VLSI Implementation of DSP Processors. Proc. of the Intl. Symp On Circuits and Systems, volume 2, 260-263, 1998.
    [3] F. Elguibaly. A Fast Parallel Multiplier-Accumulator Using the Modified Booth Algorithm. IEEE Trans Circuits and Systems, vol. 47, no 9, 902-908, 2000.
    [4] C. S. Wallace. A Suggestion for a Fast Multiplier. IEEE Trans Computers, vol. 13, no. 2, 14-17, 1994.
    [5] J. Fadavi-Ardekani. M x N Booth Encoded Multiplier Generator Using Optimized Wallace Trees. IEEE Trans Very Large Scale Integration, vol. 1, no. 2, 120-125, 1993.
    [6] J. Y. Kang, J. L. Gaudiot. A Fast and Well-Structured Multiplier. EUROMICRO Symp Digital System Design, 508-515, 2004.
    [7] M. R. Santoro, M. Horowitz. SPIM: A Pipelined 64x64-bit Iterative Multiplier. IEEE Trans. Circuits and Systems, vol. 24,no. 2, 487-493, 1989.
    [8] M. D. Ercegovac, T. Lang. Digital Arithmetic. Los Altos, Calif.: Morgan Kaufmann, 2003.
    [9] D. A. Patterson, J.L. Hennessy. Computer Architecture: A Quantitative Approach. San Mateo, Calif.: Morgan Kaufmann, 2002.
    [10] Amir Khatibzadeh, Kaamran Raahemifar. A Novel Design of a 6-GHz 8 X 8-b Pipelined Multiplier. Proceedings of the 9th International Database Engineering & Application Symposium, 2005.
    [11] Wendy Belluomini et al. A 8GHz Floating-Point Multiply. IEEE International Solid-State Circuits Conference, 375-375, 2005.
    [12] A. Bellaouar and M. I. Elmasry. Low-power Digital VLSI Design Circuit and Systems. Boston: Kluwer, 1995.
    [13] I. S. Abu-Khater, A. Bellaouar, M. I. Elmasry. Circuit techniques for CMOS low-power high-performance multiplier. IEEE J. Solid State Circuit, vol.31, No.10, 1996.
    [14] D. Tan, A. Danysh, M. Liebelt. Multiple-Precision Fixed-Point Vector Multiply-Accumulator using Shared Segmentation. In Proceedings of the 16th IEEE Symposium on Computer Arithmetic, 2003.
    [15] Y. Liao, D. B. Roberts. A High-Performance and Low-Power 32-bit Multiply-Accumulate Unit With Single-Instruction-Multiple-Data (SIMD) Feature. IEEE Journal of Solid-State Circuits, vol. 37, no. 7, 2002.
    [16]岳虹.嵌入式异构多核处理器的设计与实现:博士学位论文.长沙:国防科学技术大学研究生院, 2006.
    [17]李振涛.高性能DSP关键电路及EDA技术研究:博士学位论文.长沙:国防科学技术大学研究生院, 2007.
    [18]游余新.级连码技术的研究及VLSI设计:博士学位论文.哈尔滨:哈尔滨工业大学, 2003.
    [19] Surendra K Jain, Leilei Song, Keshab K. Parhi. Efficient semisystolic architectures for finite-field arithmetic [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 1998.
    [20] Leilei song, Keshab K. Parhi, Ichiro Kuroda, Takao Nishitani. Hardware/software codesign of finite field datapath for low-energy Reed-Solomon codecs [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2000.
    [21] Gao Lijun, Keshab K Parhi. Custom VLSI Design of Efficient Low Latency and Low Power Finite Field Multiplier for Reed-Solomon. Codec [J]. IEEE, 2001.
    [22]朱海坤.Reed-Solomon码编解码器的VLSI实现研究:硕士学位论文.上海:复旦大学, 2002.
    [23]张雄伟,曹铁勇.DSP芯片的原理与开发应用.北京:电子工业出版社, 2000.
    [24] Jan. M. Rabaey, Anantha Chandrakasan, Borivoje Nikolic. Digital Integrated Circuits: A Design Perspective, second edition. Prentice Hall. 2003.
    [25]蒋小龙.定点乘法器设计. http://www.docin.com/p-4639578.html, 2002.
    [26] Gary W. Bewick. Fast Multiplication: Algorithms and Implementation, Ph.D. thesis, Stanford University, 1994.
    [27]罗飞.银河飞腾DSP乘法部件及算术逻辑运算部件的设计:硕士学位论文.长沙:国防科技大学研究生院, 2006.
    [28] Hesham Al-Twaijry, Michael Flynn. Performance_Area Trade offs in Booth Multipliers. Technical Report CSL-TR-95-684 November 1995.
    [29]杨强.高性能DSP乘法部件的设计与实现:硕士学位论文.长沙:国防科学技术大学研究生院, 2008.
    [30] Gao Lijun, Keshab K. Parhi. Custom VLSI Design of Efficient Low Latency and Low Power Finite Field Multiplier for Reed-Solomon Codec [J]. IEEE, 2001.
    [31] Leilei song, Keshab K. Parhi, Ichiro Kuroda, Takao Nishitani. Hardware/software codesign of finite field datapath for low-energy Reed-Solomon codecs [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2000.
    [32]李振涛.基于门级单元复用的版图设计方法.国防科大计算机学院YHFT-DSP组技术报告, 2009.
    [33] Weng Fook Lee著,孙海平译. VHDL-代码编写和基于SYNOPSYS工具的逻辑综合,北京:清华大学出版社, 2007.
    [34]张永新.基于集成门控单元的门控时钟设计技术及SRAM功耗模型研究:硕士学位论文.南京:东南大学, 2004.
    [35]陈春章,艾霞,王国雄编著.数字集成电路物理设计.北京:科学出版社, 2008.
    [36] Darren Jones, gating_clocking_design. MIPS Technologies, Inc, 2002.
    [37]张志敏,傅亮. SoC设计验证技术发展综述.中国科学院计算技术研究所信息技术快报, 2008.
    [38] William K. Lam. Hardware Design Verification: Simulation and Formal Method-Based Approaches. Prentice Hall PTR, 2005.
    [39]冯超超.全定制与半定制混合设计验证方法研究:硕士学位论文.长沙:国防科学技术大学研究生院, 2007.
    [40]陈麒旭. FF-DX全定制/半定制验证混合设计流程功能和时序验证:硕士学位论文.长沙:国防科学技术大学研究生院, 2005.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700