异构多核处理芯片设计及优化

英文题名：Design and Optimization of Heterogeneous Multi-Core Processing Chip
作者：周帅
论文级别：硕士
学科专业名称：微电子学与固体电子学
中文关键词：异构多核 ; 片上网络 ; 网络接口 ; Sin/Cos运算模块 ; 矩阵转置
英文关键词：Heterogeneous Multi-Core ; Network on Chip ; Network Interface ; Sin/Cos
英文关键词：Computing Unit ; Matrix Transpose
学位年度：2012
导师：李丽
学科代码：080903
学位授予单位：南京大学
论文提交日期：2012-05-01

摘要

异构多核是当今多核处理器设计的主流趋势。其核心思想是处理器中只有一个(或几个)通用的核心完成任务调度功能,主要的计算任务(如浮点运算、信号处理、图像处理等)则由专门的高性能计算核心来完成,从而大幅度提升处理器执行效率和性能。影响异构多核处理器性能的因素有很多,最主要的是架构和计算核心的性能。本文详细介绍了一款异构多核处理芯片。该芯片顶层架构为NoC(片上网络),集成了52个异构核,包括ARM处理器、协处理器、FFT/IFFT加速单元和转置加速单元。在FPGA上实现该芯片的结果表明,它能够满足实时成像算法的实时性要求,成像效果良好。
     在原有的异构多核处理芯片的设计基础上,本文针对其中3个关键技术点进行了优化。
     针对NI(网络接口),本文提出了一种基于微码控制器的设计方法,实现了一款同时支持3种链路通信协议的网络接口。可编程的设计使得该网络接口具有很强的灵活性、适应性。相比于传统的基于FSM(有限状态机)的网络接口设计,新的设计消耗的硬件资源减少了约10%。
     针对Sin/Cos运算模块,本文从理论上分析了原有设计的误差,并提出了一种通过补偿求余来提高相位精度的方法。基于这种方法,本文设计出一款高精度的Sin/Cos运算模块,大幅度提高了求Sin函数值和Cos函数值的精度。为了节省的硬件资源消耗,改进的设计对中间数据的表示格式做了一定的优化。逻辑综合结果表明,硬件资源消耗量减少了约32%。
     针对转置加速单元,本文一方面论述了在分布式存储系统下转置大矩阵的方法,另一方面改进了原有的转置簇(含转置加速单元)。理论分析和实验结果表明,新的设计大幅度提高了转置的速度,硬件资源的消耗却减少了约15%。影响转置效率的因素有很多,例如矩阵的规模、矩阵的形状、拆分矩阵的方式、缓冲区大小等等,在实验过程中进行了分组测试,分别统计出各种因素的影响程度,为高效的使用转置簇提供了参考。
Heterogeneous multi-core is the trend of today's multi-core processor design. Its key idea is that one (or several) general-purpose core in the processor handles the task scheduling, while dedicated computing cores handle main computing tasks (such as floating-point operations, signal processing, image processing, etc.) to improve the efficiency and performance of processor. There are many factors that can affect the performance of heterogeneous multi-core processors, architecture and functionality of the cores are the most important. In this paper, a heterogeneous multi-core processing chip is introduced. Using NoC(Network on Chip) as its top-level architecture, this chip integrates52heterogeneous cores including ARM, Coprocessor, FFT/IFFT Accelerator and Matrix Transpose Accelerator. Experiment results of implementing this chip on FPGAs show that it meets the real-time requirements of the imaging algorithm.
     Based on the original designs of this heterogeneous multi-core processing chip, some optimizations have been done in this paper.
     For the NI (Network Interface), this paper presents a design method based on micro-code controller and the realization of a new NI that supports three kinds of link communication protocol. Because it can be programmed using micro-code, this NI has strong flexibility and adaptability. Compared to the original design based on FSM(Finite State Machine), the overall hardware resource consumption of the new NI is reduced by about10%.
     For the Sin/Cos Computing Unit, this paper theoretically analyzes the computing deviation of the original design and proposed a new algorithm which improves the precise of phase by compensation. Based on this algorithm, a high-precision Sin/Cos computing module is proposed, and this module improves the accuracy of Sin and Cos significantly. Optimization on the representation format of data has been done in order to save hardware resource. Logic synthesis results show this new design reduces about32%hardware resource consumption.
     For the Matrix-Transpose Accelerator, this paper discusses the method of transposing large matrix in a distributed memory system and shows a improved design of the Transpose Cluster (including the Matrix-Transpose Accelerator). Theoretical analysis and experimental results are both indicating the new design can increase the speed of transposing matrices greatly while reducing about15%hardware resource consumption. As there are many factors (such as the size and shape of transposed matrix, method chosen to divide large matrix into smaller ones, the depth of Buffer, etc.) affecting the performance of Transpose Cluster, some statistical results on them have been derived from experiments providing a reference for the efficient utilization of this cluster.

引文

[1]S. Gochman, A. Mendelson, A. Naveh and E. Rotem "Introduction to Intel Core Duo processor architecture", Intel Technol. J., vol.10, pp.89-97,2006.
    [2]J.L. Manferdelli, N.K.Govindaraju and C.Crall. Challenges and Opportunities in Many-Core Computing, Proc. of The IEEE, Vol.96, Issue 5, pp.808-815,2008.
    [3]G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS Conference Proceedings, Vol.30, pp.483-485,1967.
    [4]K.Gruttner, P.A.Hartmann, P.Reinkemeier, F.Oppenheimer and W.Nebel. Challenges of Multi- and Many-Core Architectures for electronic system-level design,2011 International Conference on Embedded Computer Systems(SAMOS), pp.331-338,2011.
    [5]M.Yuffe, M.Mehalel, E.Knoll, J.Shor et al. A Fully Integrated Multi-CPU, Processor Graphics, and Memory Controller 32-nm Processor, IEEE Journal of Solid-State Circuits, Vol.47, No.1, pp.194-205, Jan.2012.
    [6]Y.Tanabe, M.Sumiyoshi, M.Nishiyama et al. A 464GOPS 620GOPS/W heterogeneous multi-core SoC for image-recognition applications. IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp.222-232,2012.
    [7]H.M.Waidyasooriya, Y.Ohbayashi, M.Hariyama and M.Kameyama. Memory Allocation Exploiting Temporal Locality for Reducing Data-Transfer Bottlenecks in Heterogeneous Multicore Processors. IEEE Trans, on Circuits and Systems for Video Technology, Vol.21, Issue 10, pp.1453-1466,2011.
    [8]A.Z.Jooya, A.Baniasadi and M.Analoui. History-aware, resource-based dynamic scheduling for heterogeneous multi-core processors. IET Computers&Digital Techniques, Vol.5, Issue 4, pp.254-262,2011.
    [9]H.Kondo, M.Nakajima, N.Masui et al. Design and Implementation of a Configurable Heterogeneous Multi-core SoC with Nine CPUs and Two Matrix Processors. IEEE Journal of Solid-State Circuits, Vol.43, Issue 4, pp.892-901,2008.
    [10]L.Benini and G.De Micheli, "Networks on chips: A new SoC paradigm", IEEE Computer, vol. 35, no.1, pp.70-78, Jan.2002.
    [11]M. B. Taylor et al. The RAW microprocessor: A computational fabric for software circuits and general-purpose programs, IEEE Micro,vol.22, no.2, pp.25-35, Mar./Apr.2002.
    [12]A. Jantsch, J.P. Soininen, M.Forsell et al. A network on chip architecture and design methodology, in Proceedings of IEEE Computer Society Annual Symposium on VLSI, pp.105-112,2002.
    [13]P. Gratz, C. Kim, R. McDonald, S. W. Keckler, and D. C. Burger, Implementation and evaluation of on-chip network architectures, in Proc. Int. Conf. Comput. Des., pp.477-484, Oct.2006.
    [14]John C Curlander, Robert N McDonough, Synthetic Aperture Radar-Systems and Signal Processing, New York John Wiely & Sons, INC,1991。
    [15]皮亦鸣,杨建宇,付毓生,杨晓波.合成孔径雷达成像原理.电子科技大学出版社,2007.
    [16]苗澎.并行DSP开发系统在SAR成像算法应用中的研究.硕士学位论文,南京航空航天大学,2002.
    [17]施慧莉.基于多DSP的机载SAR雷达距离相处理研究.硕士学位论文,西北工业大学,2001.
    [18]王俊.全数字式高分辨率SAR实时处理机研究.博士学位论文,北京航空航天大学,2001.
    [19]张德峰.高速DSP板的开发及其在合成孔径雷达实时信号处理中的应用研究.博士学位论文,中国科学院,2001.
    [20]肖欣.SAR实时处理机的FPGA实现.硕士学位论文,电子科技大学,2004.
    [21]赵博FPGA在实时SAR成像系统中的应用.硕士学位论文,西安电子科技大学,2010.
    [22]李晓飞.基于FPGA的SAR实时成像实现技术研究.硕士学位论文,电子科技大学,2009.
    [23]Sumant Sat he, Daniel Wiklund, Dake Liu. Design of a Switching Node(Router) for On-Chip Networks, Proceedings of the 5th International Conference on ASIC, Beijing,2003, pp.75-78.
    [24]"AMBATM Specification Revision 2.0", May 13,1999.
    [25]A.Radulescu, J.Dielissen, S.G.Pestana, O.P.Gangwal, E.Rijpkema, P.Wielage and K.Goossens, An efficient on-chip NI offering guaranteed services, shared-memory abstraction, and flexible network configuration, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.24, Issue:1,2005, pp.4-17.
    [26]R.Dafali and J-Ph.Diguet, Self-Adaptive Network Interface (SANI):local component of a NoC configuration manager, International Conference on Reconfigurable Computing and FPGAs,2009,pp.296-301.
    [27]B.Attia, W.Chouchene, A.Zitouni, A.Nourdin and R.Tourki, Design and implementation of low latency network interface for network on chip,2010 5th International Design and Test Workshop (IDT), Dec.2010, pp.37-42.
    [28]B. A. A.Zitouni and R.Tourki, Design and implementation of network interface compatible OCP For packet based NOC,2010 5th International Conference on Design and Technology of Integrated Systems in Nanoscale Era (DTIS),2010, pp.1-8.
    [29]Open Core Protocol Specification, Release 2.0, www.ocpip.org, OCP-IP Association,2003.
    [30]M.T.Rose. The Open Book: A Practical Perspective on OSI, Prentice Hall,1990.
    [31]Yong-Long Lai, Shyue-Wen Yang, Ming-Hwa Sheu, Yin-Tsung Hwang, Hui-Yu Tang and Pin-Zhang Huang. A High-Speed Network Interface Design for Packet-Based NoC, International Conference on Communications, Circuits and Systems Proceedings,2006, pp.2667-2671.
    [32]N.Seongmin, K.Daehyun, N.Vu-Duc and C.Hae-Wook. Performance and Complexity Analysis of Credit-Based End-to-End Flow Control in Network-on-Chip, Springer Berlin Heidelberg, Parallel and Distributed Processing and Applications, In Proceedings,2007, pp.268-277.
    [33]E.Beigne and P.Vivet, Design of on-chip and off-chip interfaces for a GALS NoC architecture,12th IEEE International Symposium on Asynchronous Circuits and Systems, Mar.2006, pp-183.
    [34]J.E. Volder, "The CORDIC trigonometric computing technique", IRE Trans. On Electronic Computes, Vol EC-8, Sept.1959, pp.330-334
    [35]马士超,王贞松.基于DSP的三角函数快速计算.计算机工程,第31卷,第22期,2005年11月,pp.12-14.
    [36]R.Andraka. A survey of CORDIC algorithms for FPGA based computers. ACM Press,1998.
    [37]Y.H. Hu. CORDIC-Based VLSI architecutres for Digital signal processing. IEEE Signal Processing, Vol.9, Issue 3, July 1992, pp.16-35.
    [38]G.L. Haviland and A.A. Tuszynski. A Cordic Arithmetic Processor Chip. IEEE Journal of Solid-State Circuits, Vol.15, Issue 1, Feb.1980, pp.4-15.
    [39]T.Y. Sung and Y.H. Sung. The Quantization Effects of CORDIC Arithmetic for Digital Signal Processing Applications. T21st Workshop he on Combinatorial Mathematics and Computation Theory, Taichung, Taiwan, May 2004, pp.16-25.
    [40]T.Y. Sung. Numerical Accuracy and Hardware Trade-Offs for Fixed-Point CORDIC Processor for Digital Signal Processing System. Proceedings of th 7th WSEAS International Conference on Multimedia Systems & Signal Processing, Hangzhou, China, Apr.2007, pp.106-111.
    [41]R.J. Cavallaro and T.L.Franklin. Cordic Arithemetic for an SVD processor. Journal of Parallel and Distributed Computing, Vol.5,Issue 3,June 1988, pp.271-290.
    [42]陆鹏威,梅灼春.基于cordic算法实现三角函数的运算.国外电子测量技术,第27卷,第1期,2008年1月,pp.10-11.
    [43]陈世淼,郭绍忠,陈建勋,王磊.一种基于SIMD功能部件处理器的三角函数性能优化算法.信息工程大学学报,第12卷,第1期,2011年1月,pp.103-106.
    [44]Ren Gang, Han Jizhong, Han Chengde. CTM on multiprocessor: solution for bottleneck of SAR, Proceedings of ICSP,2000, pp.1915-1920.
    [45]刘晨,张涛.基于DDR SDRAM的CTM算法与实现.火控雷达技术,第39卷,第3期,2010年9月,pp.23-27.
    [46]谢应科,张涛,韩承德.实时SAR成像系统中矩阵转置的设计和实现,计算机研究与发展,第40卷,第1期,2003年1月,pp.6-11.
    [47]白海龙,全英汇,王虹现,王彤.基于DDR2SDRAM的SAR成像转置存储器的FPGA实现,现代电子技术,第1期,2008,pp.48-50.
    [48]李早社,禹卫东,汪亮,郑小双.基于SDRAM的星载SAR星上实时成像转置存储器.信号处理,第23卷,第3期,2007年6月,pp.433-436.
    [49]卢世祥,韩松,王岩飞.合成孔径雷达实时成像转置存储器的两页式结构与实现.电子与信息学报,第27卷,第8期,2005年8月,pp.1226-1228.
    [50]陈琦,熊红兵,杨汝良.合成孔径雷达成像系统逆存储转置器的DSP设计与实现,微计算机信息,第22卷,第5期,2006年,pp.145-148.
    [51]刘畅,牛晓丽,王岩飞.基于微处理器的SAR成像处理转置存储.现代雷达,第27卷,第4期,2005年4月,pp.32-34.
    [52]Mingming Bian, Fukun Bi and Feng Liu. Matrix Transpose Methods for SAR imaging System. IEEE 10th International Conference on Signal Processing(ICSP), Oct.2010, pp.2176-2179.
    [53]边明明, 毕福昆, 汪精华,“实时SAR成像系统矩阵转置方法研究与实现”,计算机工程与应用,2011-47(22),PP.117-119.
    [54]王坤.SAR实时成像处理机采集和转置模块的设计与实现.硕士学位论文,电子科技大学,2010.
    [55]J.Choi, J.J.Dongarra and D.W.Walker. Parallel Matrix Transpose Algorithmson Distributed Memory Concurrent Computers. Proceedings of the Scalable Parallel Libraries Conference, Oct.1993,pp.245-252.
    [56]M.R. Portnoff. An Efficient Parallel-Processing Method for Transposing Large Matrices in Place. IEEE Trans. On Image Proceeding, Vol.8, Issue 9, Sep.1999, pp.1265-1275.
    [57]A.S.Zekri and S.G.Sedukhin. Matrix Transpose on 2D Torus Array Processor. IEEE International Conference on Computer and Information Technology,2006, PP.45-45.
    [58]A.A. Ravankar and S.GSeduhin. An O(n) Time-Complexity Matrix Transpose on Torus Array Processor.2011 2nd International Conference on Networking and Computing,2011, pp.242-247.
    [59]R.A.Na'mneh, W.D.Pan and R.Adhami. Communication Efficient Adaptive Matrix Transpose Algorithm for FFT on Symmetric Multiprocessors. Proceedings of the 3rd Seventh Southeastern Symposium on System Theory,2005, pp.312-315.
    [60]江帆,刘光平,周志敏.多计算机上分布式矩阵转置.微处理机,第2期,2002年5月,pp.34-37.
    [61]孟祥杰,张理论,曾泳泓.分布式存储环境下矩阵转置并行算法研究.计算机工程与科学,第21卷,第5期,1999年,pp.67-71.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700