面向可重构系统的几个常用算法及其实现技术研究

英文题名：Research on Several Frequently-used Algorithms and Their Implementation for Reconfigurable System
作者：牟胜梅
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：FPGA ; 初等函数 ; CORDIC算法 ; 复合函数求值 ; 查找表 ; 多项式逼近 ; 向量旋转 ; 快速傅立叶变换
英文关键词：FPGA ; Elementary function ; CORDIC algorithm ; Function evaluation ; Lookup table ; Polynomial approximation ; Vector rotation ; Fast Fourier transform
学位年度：2008
导师：杨晓东
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2008-10-01

摘要

用硬件实现算法一般可比软件实现快数个数量级,以FPGA为代表的可重构硬件以其速度快、功耗低、编程灵活等优点成为算法硬件加速的首选。随着集成电路技术的发展,FPGA芯片的容量和性能不断提高,其在数字信号处理、计算机技术和科学计算等领域的应用日益广泛。
     本文以减少计算延迟、优化资源使用、提高吞吐率为目标,基于可重构平台对数字信号处理和科学计算领域使用频率较高的初等函数及其复合函数求值、向量旋转和快速傅利叶变换等算法及其实现方法进行了深入研究,综合了近似算法、表驱动的求值算法、流水线组织与访存调度等优化策略,提出相关创新方法。本文的主要工作包括:
     1)研究了指数/对数函数的计算方法:对常规CORDIC算法进行压缩,将、通路合而为一,实现了专用求值器,减少了芯片面积开销;提出一个统一的指数/对数函数迭代求值算法,通过设置不同的初值和功能选择信号,仅用、通路便可实现指数或对数函数求值,与CORDIC算法相比可节省1/3以上的面积开销;针对低精度、高速度应用,设计并实现了一个基于近似算法的高速对数变换器,并利用简洁、高效的校正逻辑提高了计算精度。
     2)研究了基于查找表的函数求值方法,利用二阶minimax多项式对函数逐段逼近,实现了一个高效的多函数查表求值系统。通过合理选择分段策略,兼顾了段地址译码逻辑的复杂性与查找表的存储开销;系数逐一圆整(rounding)、三次逼近减少了系数有限精度引入的方法误差,从而减少了存储开销;采用定/浮点双通路分别计算不同特性的函数,保证了计算精度;预先进行精度控制和中间结果截断,减少了多项式的计算开销;最后合理分配各函数查找表的存储空间,实现了系统集成。
     3)面向旋转角取值范围广且不可预知以及旋转角数目有限且可预知两类应用,以减少迭代次数和面积开销为目标,提出2S-PCS和FFT CORDIC两种向量旋转算法。充分利用了FPGA片上存储资源,借助查找表辅助计算,在减少了迭代次数的同时保持了常规CORDIC算法扩展因子易于计算和补偿的优点,使其相对乘/加方法更具优势。采用28位数据通路时,与常规CORDIC算法相比,2S-PCS算法的流水线级数约减少38%,面积约减少27.9%,精度提高3位(2进制)左右,显示了算法的优良性能。最后面向两类特殊FFT应用对CORDIC算法进行了优化。
     4)以基2时分FFT算法为基础,针对浮点FFT处理器中的写后读数据相关,提出几种减少相关的方法,并通过改进运算蝶结构、合理调度FPGA片上RAM访问,设计并实现了一个高吞吐率FFT处理器,每周期可读取两个复操作数,输出两个复计算结果,吞吐率为传统FFT处理器的2倍。此外,还面向点数不固定的应用,设计并实现了一个运算蝶级联的可变长FFT处理器。
     本文所研究的算法实现方法具有通用性,不仅适用于可重构平台,略作调整也适用于ASIC设计。由于算法粒度较小,更适合与其他操作结合实现更大规模的应用。若对这些算法单独进行硬件加速,则要考虑数据输入/输出等额外开销对性能加速的影响。
Hardware-based implementations of algorithms are desirable, since they can be several orders of magnitudes faster than software-based methods. Reconfigurable devices such as Field Programmable Gate Arrays are ideal candidates for this purpose, because of their high speed, low power and flexibility. With the development of integrated circuits, the density and performance of FPGA grow steadily, which makes it more widely used in applications such as digital signal processing and scientific computing.
     Aiming at reducing latency, optimizing resource utilization and increasing throughput, in this paper, we make research on such frequently used algorithms and their implementation methods as evaluation of elementary and compound functions, vector rotation and fast Fourier transform on reconfigurable devices. Some effective methods and optimization techniques are proposed based on approximation algorithms, table-based evaluation methods, pipelining and memory access scheduling. Our work includes:
     1) Research on the evaluation methods of exponential and logarithm function: First, we design a simplified circuit based on CORDIC algorithm to compute exponential function, which fuses the datapath of x and y to reduce the area cost. Second, we propose a unified exponential and logarithm function evaluation algorithm called LnE; after setting initial values and function selection signal properly, the wanted function are computed. LnE algorithm spares z datapath of conventional CORDIC algorithm and saves more than 1/3 of the area estate. Thirdly, we design and implement a fast binary logarithm calculator based on an approximation algorithm to cater the applications with requirement of low accuracy but high speed, with simple and effective correcting circuits to improve accuracy.
     2) Propose an effective table-based multi-function evaluation algorithm, which uses minimax quadratic approximation on each segment. It chooses different segmentation techniques to tradeoff segment address decoding complexity and memory cost of lookup tables, and uses fixed-point and floating-point datapath to compute nearly linear and highly non-linear functions respectively. Through a three-pass iterative approximation process, the algorithm takes into account the effect of rounding the polynomial coefficients to a finite size, allowing for a further reduction in the size of lookup tables to be used. Through accuracy controlling and intermediate results truncation, the area and delay of the system are reduced. Finally a multi-function system is realized by skillfully allocating memory space to different lookup tables.
     3)There are two types of vector rotation applications, one with random and unknown rotation angles, and the other with pre-known and limited rotation angles. Based on such two kinds of applications, we propose 2S-PCS and FFT CORDIC algorithms to reduce iteration numbers and area occupation. Making use of lookup tables to support computing, these two algorithms reduce the iteration number and computing delay greatly while not compromising the ease of scale factor computing and compensation as that in conventional CORDIC algorithm. When using 28-bit datapath, 2S-PCS algorithm requires 38% less pipeline stages and 27.9% less area consumption compared to those of conventional CORDIC algorithm, while achieving 3 more binary bits accuracy, which shows great performance superiority.We still make some improvements on conventional CORDIC algorithm based on two special FFT applications.
     4) Based on radix-2 time decimation FFT algorithms, we design a variable-length fixed-point FFT processor based on cascaded butterflies. Furthermore, we analyze the read-after-write dependency in floating-point FFT processors, give some proposals to reduce it, and realize a 32-bit IEEE 754 single-precision FFT processor with modified structure of butterfly units and optimized RAM access, which makes it possible to read and write two complex operands simultaneously per cycle, and double the throughput of traditional FFT processors.
     It should be mentioned that the implementation methods of algorithms discussed in this paper are general and not limited to reconfigurable hardware; they can also be used to guide ASIC design by making some modifications. It’s desirable to integrate these algorithms into larger applications to achieve higher acceleration ratios. If just implementing these algorithms alone in hardware, we should also consider the effect of communication overhead between hardware and software.

引文

[1]安虹.用可重构计算技术实现高效能通用微处理芯片.信息技术快报, 2006, 4(6).
    [2] Katherine Compton, Scott Hauck.Reconfigurable Computing: A Survey of Systems and Software.ACM Computing Surveys, 2002, 34(2):171–210.
    [3] T.J. Todman, G.A. Constantinides, S.J.E. Wilton, et al. Reconfigurable Computing: Architectures and Design Methods. IEE Proceedings - Computers and Digital Techniques, 2005, 152(2):193-207.
    [4] E.P.O'Grady, C.H.Wang. Performance Limitations in Parallel Processor Ssimulations. Trans. Society for Computer Simulation, 1987(4): 311-330.
    [5] N. Sidahoao, G.A. Constantinides, P.Y. Cheung. Architectures for Function Evaluation on FPGAs. In Proceedings of IEEE International Symposium on Circuits and Systems (vol. 2), London, 2003: 804-807.
    [6] J.E. Volder. The CORDIC Trigonometric Computing Technique. IEEE Transactions on Electrical Computers, 1959, EC-8(3):330-334.
    [7] J.S. Walther. A Unified Algorithm for Elementary Functions. In Proceedings of AFIPS Spring Joint Computer Conference, 1971: 379-385.
    [8] J. Duprat, J.M. Muller. The CORDIC Algorithm: New Results for Fast VLSI Implementation. IEEE Transactions on Computers, 1993, 42(2):168-178.
    [9] R. Andraka. A Survey of CORDIC Algorithms for FPGA-based Computers. In Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 1998:191-200.
    [10] D. Das Sarma, D.W. Matula. Faithful Bipartite ROM Reciprocal Tables. In Proceedings of IEEE Symposium on Computer Arithmetic, Bath, England, UK, 1995:17-28.
    [11] J.M. Muller. A Few Results on Table-based Methods. Reliable Computing, 1999, 5(3):279-288.
    [12] M.J. Schulte, J.E. Stine. Symmetric Bipartite Tables for Accurate Function Approximation. In Proceedings of 13th IEEE Symposium on Computer Arithmetic, Asilomar, CA, USA, 1997:175-183.
    [13] M.J. Schulte, J.E. Stine. Approximating Elementary Functions with Symmetric Bipartite Tables. IEEE Transactions on Computers, 1999, 48(9):842-847.
    [14] J.E. Stine, M.J. Schulte. The Symmetric Table Addition Method for Accurate Function Approximation. Journal of VLSI Signal Processing, 1999, 21(2): 167-177.
    [15] F. Dinechin, A. Tisserand. Some Improvements on Multipartite Table Methods. In Proceedings of IEEE Symposium on Computer Arithmetic, Vail, Colorado,USA, 2001:128-135.
    [16] I. Koren, O. Zinaty. Evaluating Elementary Functions in a Numerical Coprocessor Based on Rational Approximations. IEEE Transactions on Computers, 1990, 39(8):1030-1037.
    [17] J.F. Hart. Computer Approximations. New York: John Wiley & Sons, 1968.
    [18] J.R. Rice. The Approximation of Functions (vol. 2). New Jersey, USA: Addison-Wesley, 1969.
    [19] N. L. Carothers. A Short Course on Approximation Theory. 1998. http: //personal.bgsu.edu/ ~carother/ Approx.html.
    [20] W.J. Cody, W. Waite. Software Manual for the Elementary Functions. New Jersey: Prentice Hall, 1980.
    [21] D. Defour, P. Kornerup, J. Muller, et al. A New Range Reduction Algorithm. In Proceedings of 35th Asilomar Conference on Circuits, Systems, and Computers (vol. 2) , Pacific Grove, California, USA, 2001: 1656-1660.
    [22] V. Lefevre, J.M. Muller. On-the-fly range reduction. Journal of VLSI Signal Processing, 2003, 33(1/2): 31-35.
    [23] R.C. Li, S. Boldo, M. Daumas. Theorems on Efficient Argument Reductions. In Proceedings of IEEE Symposium on Computer Arithmetic, Santiago de Compostela, Spain, 2003:129-136.
    [24] J.M. Muller. Elementary Functions: Algorithms and Implementation (2nd ed). Boston, MA :Birkh?user, 2006.
    [25] Andreas Wassatsch, Steffen Dolling, Dirk Timmermann. Area Minimization of Redundant CORDIC Pipeline Architectures. In Proceedings of the International Conference on Computer Design, Austin, TX, USA, 1998:136-141.
    [26]李滔,韩月秋.基于流水线CORDIC算法的三角函数发生器.电子技术应用1999, 25(6):52-53.
    [27] E. Antelo, J. Brugera, E. Zapata. Unified Mixed Radix 2-4 Redundant Cordic Processor. IEEE Transactions on Computers, 1996, 45(9):1068–1073.
    [28] D. S. Phatak. Double Step Branching CORDIC: A New Algorithm for Fast Sine and Cosine Generation. IEEE Trans. Computers, 1998, 47(5): 587- 602.
    [29]李辉,吕明.基于CORDIC算法的直接数字频率合成器.实验科学与技术2006(2):115-118.
    [30] Satish Ravichandran, Vijayan Asari. Implementation of Unidirectional CORDIC Algorithm Using Precomputed Rotation Bits. In proceedings of the 45th Midwest Symposium on Circuits and Systems (vol.3), Tulsa, OK, 2002:453-456.
    [31] S.F. Hsiao, J.M. Delosme. Householder CORDIC Algorithms. IEEE Transactions On Computers, 1995, 44(8): 990-1001.
    [32] D. Timmermann, S. Dolling. Unfolded Redundant CORDIC VLSI Architectures with Reduced Area and Power Consumption. In proceedings of VLSI’97,Gramado, Brasilien,1997: 1-12.
    [33] Elisardo Antelo, Julio Villalba, Javier D.Bruguera, et al. High Performance Rotation Architectures Based on the Radix-4 CORDIC Algorithm. IEEE Transactions On Computers, 1997, 46(8): 855-870.
    [34] Behrooz Parhami. Computer Arithmetic: Algorithms and Hardware Design. Oxford, USA: Oxford University Press, 1999.
    [35] Mitchell J.N. Computer Multiplication and Division Using Binary Logarithms. IRE Trans Electronic Computers, 1962.11(8): 512-517.
    [36] Duncan J. McLaren. Improved Mitchell-Based Logarithmic Multiplier for Low-power DSP Applications. In Proceedings of IEEE International SOC Conference, Portland, USA, 2003:53-56.
    [37] S Ramaswamy, R. Siferd. CMOS VLSI Implementation of a Digital Logarithmic Multiplier. In Proceedings of the IEEE National Aerospace and Electronics Conference, Dayton, USA, 1996: 291-294.
    [38] V. Oklobdzija. An Implementation Algorithm and Design of a Novel Leading Zero Detector Circuit. In Proceedings of the 26th IEEE Asilomar Conf. Signals, Systems, and Computers, Pacific Grove, USA,1992: 391-395.
    [39] V. Oklobdzija. An Algorithmic and Novel Design of a Leading Zero Detector Circuit: Comparison with Logic Synthesis. IEEE Trans. VLSI Systems, 1994, 2 (1):124-128.
    [40] M. Schmookler, D. Mikan. Two-State Leading Zero/One Anticipator (LZA). US Patent, 5493520. 1996.02.
    [41]李笑盈,孙富明.浮点加法运算器前导1预判电路的实现.计算机工程与应用, 2002, 38(21): 142-143,146.
    [42] K. H. Abed, R. Siferd. CMOS VLSI Implementation of 16-bit Logarithm and Anti-logarithm Converters. In Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems (vol. 3), LanSing USA, 2000: 776-779.
    [43] Khalid H. Abed, Raymond E. Siferd. CMOS VLSI Implementation of a Low-Power Logarithmic Converter. IEEE Transactions on Computers, 2003, 52 (11): 1421-1433.
    [44] M. Combet, H. Zonneveld, L. Verbeek. Computation of the Base Two Logarithm of Binary Numbers. IEEE Trans. Electronic Computers, 1965, 14(12):863-867.
    [45] E.L. Hall, D.D. Lynch, S.J. Dwyer III. Generation of Products and Quotients Using Approximate Binary Logarithms for Digital Filtering Applications. IEEE Trans. Computers, 1970, 19(2): 97-105.
    [46] W.F. Wang,E. Goto. Fast Hardware-Based Algorithms for Elementary Function Computations Using Rectangular Multipliers. IEEE Trans. Computers, 1994, 43(3): 278-294.
    [47] M.J.Schulte, J.E.Stine. Approximating Elementary Functions with SymmetricBipartite Tables. IEEE Trans. Computer, 1999, 48(8): 842-847.
    [48] Dong-U Lee. Reconfigurable Hardware for Function Evaluation and LDPC Coding. MPhil/PhD Transfer Report. 2003. http://www.ee.ucla.edu/~dongu/pub/ papers/transfer03_dul98.pdf
    [49] Shinobu Nagayama , Tsutomu Sasao. Programmable Numerical Function Generators Based on Quadratic Approximation: Architecture and Synthesis Method. In Proceedings of Asia South Pacific design automation conference, Yokohama, Japan, 2006: 378-383.
    [50] Jérémie Detrey, Florent de Dinechin. Second Order Function Approximation with a Single Small Multiplication, research report. 2004. http://lara.inist.fr/bitstream/2332/1005/1/RR2004-13.pdf
    [51] Shinobu Nagayama, Tsutomu Sasao, Jon T. Butler. Numerical Function Generators Using Edge-Valued Binary Decision Diagrams. In Proceedings of the 12th Asia and South Pacific Design Automation Conference, Yokohama, Japan, 2007: 535-540.
    [52] Dong-U Lee, Wayne Luk. Hierarchical Segmentation Schemes for Function Evaluation. In proceedings of 2003 IEEE International Conference on Field Programmable Technology (FPT), Tokyo, Japan, 2003:92-99.
    [53] Tsutomu Sasao. A Design Method of Address Generators Using Hash Memories. In proceedings of International Workshop on Logic and Synthesis, Vail, Colorado, U.S.A, 2006: 102-109.
    [54] L. Veidinger. On the Numerical Determination of the Best Approximations in the Chebychev Sense. Numerische Mathematik, 1960, 2(1): 99-105.
    [55] Dong-U Lee, Oskar Mencer, Wayne Luk. Optimizing Hardware Function Evaluation. IEEE Trans. Computers, 2005, 54(12):1520-1531.
    [56] Nicolas Brisebarre, Jean-Michel Muller. Sparse-Coefficient Polynomial Approximations for Hardware Implementations. In proceedings of the 38th Asilomar Conference on Signals, Systems and Computers(vol. 1), Pacific Grove, California , 2004: 532-535.
    [57]李庆扬,王能超,易大义.数值分析(第四版).北京:清华大学出版社,施普林格出版社,2001.
    [58] J. A. Pineiro. High-Speed Function Approximation Using a Minimax Quadratic Interpolator. IEEE Transactions on Computers, 2005,54 (3): 304-318.
    [59] Dong-U Lee. Hardware Designs for Function Evaluation and LDPC Coding: [dissertation]. London: Univ. of Imperial College, 2004.
    [60] Dong-U Lee, Wayne Luk1, John Villasenor, et al. Non-uniform Segmentation for Hardware Function Evaluation. In proceedings of the 13th International Conference on Field Programmable Logic and Application, Lisbon, Portugal, 2003: 796-807.
    [61] J. A. Pineiro, J. D. Bruguera, J. M. Muller. Faithful Powering Computation Using Table Look-Up and a Fused Accumulation Tree. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic, Vail, Colorado, USA, 2001: 40.
    [62] Mathew Wojko, Hossam Elgindy. On Determining Polynomial Evaluation Structures for FPGA based Custom Computing Machines. In Proceedings of 4th Australasian Computer Architecture Conference, Auckland, New Zealand, 1999:11-22.
    [63] Ping Tak Peter Tang. Table Lookup Algorithms for Elementary Functions and Their Error Analysis. In Proceedings of 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, 1991:232-236.
    [64] Jean-Michel Muller. Partially Rounded Small-Order Approximations for Accurate Hardware-Oriented Table-Based Methods. In proceedings of 16th IEEE Symposium on Computer Arithmetic, Santiago de Compostela, Spain, 2003:114-121.
    [65] Shinobu Nagayama, Tsutomu Sasao, Jon T. Butler. Numerical Function Generators Using Edge-Valued Binary Decision Diagrams. In Proceedings of the 12th Conference on Asia South Pacific Design Automation, Yokohama, Japan, 2007: 535-540.
    [66] Dong-U Lee, Oskar Mencer, David J. Pearce, et al. Automating optimized table-with-polynomial function evaluation for FPGAs. In Proceedings of the 14th International Conference on Field Programmable Logic and Application, Leuven, Belgium, 2004: 364-373.
    [67] Daniel Larkin, Andrew Kinane, Valentin Muresan, et al. An Efficient Hardware Architecture for a Neural Network Activation Function Generator. In Proceedings of 3rd International Symposium on Neural Networks, Chengdu, China, 2006:1319-1327.
    [68] Jason Todd Arbaugh. Table Look-up CORDIC: Effective Rotations through Angle Partitioning: [dissertation]. Austin, Texas: Univ. of Texas, 2004.
    [69] Jen-Chuan Chin, Sau-Gee Chen. Fast CORDIC Algorithm Based on a New Recoding Scheme for Rotation Angles and Variable Scale Factors. Journal of VLSI Signal Processing, 2003, 33(1):19-29.
    [70] Y. H. Hu, S. Naganathan. An Angle Recoding Method for CORDIC Algorithm Implementation. IEEE Trans. Computers, 1993, 42(1): 99-102.
    [71] C. S. Wu, A. Y. Wu. Modified Vector Rotational CORDIC (MVR-CORDIC) Algorithm and its Application to FFT. In Proc. of IEEE Int. Symp. Circuits and Systems, Geneva, 2000: 529-532.
    [72] C. S. Wu, A. Y. Wu. A Novel Rotational VLSI Architecture based on Extended Elementary-angle Set CORDIC algorithm. In Proc. of 2nd IEEE Asia Pacific Conference on ASICs, Cheju, Korea, 2000:111-114.
    [73] Z.-X. Lin, A.-Y. Wu. Mixed-scaling-rotation CORDIC (MSR-CORDIC) Algorithm and Architecture for Scaling-free High-performance Rotational Operations. In Proc. of IEEE Int. Conf. Acoustics, Speech, Signal Processing (vol. 2),Hong Kong, China, 2003:653-656.
    [74] Koushik Maharatna, Swapna Banerjee. Modified Virtually Scaling-Free Adaptive CORDIC Rotator Algorithm and Architecture. IEEE Transactions on Circuits and Systems for Video Technology, 2005, 15(11): 1463-1474.
    [75] C.C. Li, S.G. Chen. A Radix-4 Redundant CORDIC Algorithm with Fast On-Line Variable Scale Factor Compensation. In Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing, Munich, Germany, 1997:639-642.
    [76] C.C. Li, S.G. Chen. New Redundant CORDIC Algorithm with Variable Scale Factor Compensations. In Proceedings of IEEE International Symposium Circuits and Systems, Atlanta, USA, 1996: 264-267.
    [77] N. Takagi, T. Asada, S.Yajima. Redundant CORDIC Methods with a Scale Factor for Sine and Cosine Computation. IEEE Trans. on computers, 1991, 40(9): 989-995.
    [78] J.R. Cavallaro, N.D. Hemkumar. Redundant and On-Line CORDIC for Unitary Transformations. IEEE Trans. Comput., 1994, 43(8): 941-954.
    [79] J.A. Lee, T. Lang. Constant-Factor Redundant CORDIC for Angle Calculation and Rotation. IEEE Trans. Comput., 1992, 41(8):1016-1025.
    [80] M.D. Ercegovac, T. Lang. Redundant and On-Line CORDIC: Application to Matrix Triangularization and SVD. IEEE Trans. Comput., 1990, 39(6): 725-740.
    [81] J. R. Cavallaro, F. T. Luk. Floating-point CORDIC for Matrix Computations. In Proceedings of IEEE International Conference on Computer Design, New York, 1988: 40-42.
    [82] G. J. Hekstra, E. F. Deprettere. Floating-Point CORDIC. In Proc. of 11th Symp. Computer Arithmetic, Windsor, Canada, 1993:130-137.
    [83] J.A. Lee, K.J. Kolk, F.A. Deprettere. A Low-Cost Floating Point Vectoring Algorithm Based on CORDIC. IEICE TRANS. Fundamentals, 2000, E83-A(8): 1654-1662.
    [84] K. Maharatna, A. S. Dhar, S. Banerjee. A VLSI Array Architecture for Realization of DFT, DHT, DCT and DST. Journal of Signal Processing, 2001, 81(9): 1813-1822.
    [85] A. S. Dhar, S. Banerjee. An Array Architecture for Fat Computation of Discrete Hartley Transform. IEEE Trans. Circuits Syst., 1991, 38(9): 1095-1098.
    [86]同济大学数学教研室.高等数学(上册)(第四版).北京:高等教育出版社, 1996:177.
    [87] K. Kota, J. R. Cavallaro. Numerical Accuracy and Hardware Tradeoffs forCORDIC Arithmetic for Special Purpose Processors. IEEE Transactions on Computers, 1993, 42(7):769-779.
    [88] C. K. Koc. Parallel Canonical Recoding. Electronics Letters, 1996, 32(22): 2063-2065.
    [89] Teemu Pitkanen, Tero Partanen, Jarmo Takala. Low-Power Twiddle Factor Unit for FFT Computation. In Proceedings of 7th International Workshop on Embedded Computer Systems, Samos, Greece, 2007:65-74.
    [90] E. Antelo, J. Villalba, J.D. Bruguera, et al. High Performance Rotation Architectures based on the Radix-4 CORDIC algorithm. IEEE Transactions on Computers, 1997.46(8):855-870.
    [91] C. S. Wu, A. Y. Wu. A Novel Trellis-based Searching Scheme for EEAS-based CORDIC algorithm. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2), Salt Lake City, Utah, USA, 2001: 1229-1232.
    [92] C.S Wu, A.Y. Wu. Modified Vector Rotational CORDIC (MVR-CORDIC) Algorithm and Architecture. IEEE Trans. Circuits and Systems-II: Analog and Digital Signal Processing, 2001, 48(6):548-561.
    [93] Y.H. Hu. CORDIC-based VLSI Architectures for Digital Signal Processing. Journal of Signal Processing, 1992, 9(3):16-35.
    [94] A.Y. Wu, C.S. Wu. A Unified Design Framework for Vector Rotational CORDIC Family based on Angle Quantization Process. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2), Salt Lake City, Utah, USA, 2001: 1233-1236.
    [95]李辉,吕明.基于CORDIC算法的直接数字频率合成器.实验科学与技术. 2006(2):115-118.
    [96]熊君君,王贞松,姚建平.无查找表快速FFT的FPGA实现:第八届研究生学术研讨会, 2004.
    [97]熊君君,王贞松,姚建平.用FPGA实现星载SAR实时成像处理器的工程方法.计算机工程, 2006, 32(5):223-225.
    [98] D. Takahashi. High-Performance Parallel FFT Algorithms for the HITACHI SR8000. In Proceedings of the 4th International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region (Vol. 1), Beijing, China, 2000: 192-199.
    [99] D. Takahashi, Y. Kanada. High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers. Journal of Supercomputing, 2000,15 (2):207-228.
    [100] K. Tanno, T. Taketa, S. Horiguchi. Parallel FFT Algorithms Using Radix 4 Butterfly Computation on an Eight-Neighbor Processor Array. Journal ofParallel Computing, 1995, 21(1):121-136.
    [101]万红星,陈禾,韩月秋.一种高速并行FFT处理器的VLSI结构设计.电子技术应用, 2005(5): 45-48.
    [102] ALDEC Corporation. ALDEC FFT IP Core Data Sheet (version 1.0). 2006-4-11. http://www.aldec.com/Controls/ByteArrayHttpHandler.axd?key=ReleaseNotes_Product_e374ae84-55ca-492e-842d-680dacabc780&type=application%2Fpdf&name=IC-FFT.pdf&size=0
    [103] Altera Corporation. Floating-Point FFT Processor (IEEE 754 Single Precision) Radix-2 Core. 2003. http://www.altera.com.cn/literature/wp/wp_fft_radix2 .pdf.
    [104] http://www.xilinx.com/ipcenter (search“FFT”)
    [105]冷建华.傅里叶变换.北京:清华大学出版社.2004.
    [106]于效宇,宋立新,刘艳. CORDIC流水线结构在FFT设计中的改进.哈尔滨理工大学学报. 2005, 10(1):55~57
    [107] Benjamin Heyne, Jurgen Gotze. A Pure CORDIC-based FFT for Reconfigurable Digital Signal Processing. In Proceedings of 12th European Signal Processing Conference, Vienna, Austria, 2004:1513-1516.
    [108] Benjamin Heyne, Jurgen Gotze, Martin Bucker. Implementation of a CORDIC Based FFT on a Reconfigurable Hardware Accelerator. In Proceedings of 3rd Karlsruhe Workshop on Software Radios, Karlsruhe, Germany, 2004.
    [109] G. D. Bergland, H. W. Hale. Digital Real-Time Spectral Analysis. IEEE Transactions on Computers, 1967, EC-16(2):180-185.
    [110] G. C. O’Leary. Non-recursive Digital Filtering Using Cascade Fast Fourier Transformers. IEEE Trans. Audio Electroacoustics, 1970, AU-18(2): 177-183.
    [111]高振斌,陈禾,韩月秋.可变2n点流水线FFT处理器的设计与实现.北京理工大学学报, 2005, 25(3): 268-271.
    [112] D. R. Bungard, L. Lau, T. L. Rorahaugh. New Programmable FFT Implementation for Radar Signal Processing. In Proceedings of IEEE International Symposium on Circuits and Systems (Vol. 2), Portland, OR, USA, 1989:1323-1327.
    [113]王世一.数字信号处理.北京:北京理工大学出版社,1997.
    [114] Jen-Chih Kuo, Ching-Hua Wen, Chih-Hsiu Lin, et al. VLSI Design of a Variable-Length FFT/IFFT Processor for OFDM-Based Communication Systems. EURASIP Journal on Applied Signal Processing, 2003(13):1306-1316.
    [115]王天云,姜秋喜.基于CORDIC算法并行FFT设计与硬件实现.电子工程, 2005(3): 51-54.
    [116] Zhenyu Liu, Yang Song, Takeshi Ikenaga, et al. A VLSI Array Processing Oriented Fast Fourier Transform Algorithm and Hardware Implementation. In Proceedings of the 15th ACM Great Lakes Symposium on VLSI, Chicago,Illinois, USA, 2005: 291-295.
    [117]汪洋,葛临东.利用CORDIC算法在FPGA中实现可参数化的FFT.微计算机信息. 2005, 21(7): 101-103.
    [118]党向东.基于FPGA的FFT信号处理器的硬件实现.沈阳工业学院学报. 2002, 21(3): 45-48.
    [119]王旭东,刘渝.一种新结构FFT算法及其FPGA实现.无线通信技术, 2005(3): 46-49.
    [120]刘朝辉,韩月秋.用FPGA实现FFT的研究.北京理工大学学报, 1999,19(2): 234-238.
    [121]王远模,赵宏钟,张军,等.自定制浮点FFT/IFFT处理器的FPGA实现研究.系统工程与电子技术, 2005,27(7): 1318-1321.
    [122]丁智泉,张红雨.高速浮点FFT处理器的FPGA实现.四川理工学院学报(自然科学版), 2006,19(1): 60-63.
    [123]赵忠武,陈禾,韩月秋.基于FPGA的32位浮点FFT处理器的设计.电讯技术, 2003,43(6): 73-77.
    [124] ChenHe,ZhaoZhong-wu. ASIC Design of Floating-Point FFT Processor. Journal of Beijing Institute of Technology, 2004, 13(4): 389-393.
    [125] K. Scott Hemmert, Keith D. Underwood. An Analysis of the Double Precision Floating-point FFT on FPGAs. In Proceedings of IEEE Symposium on Field Programmable Custom Computing Machines, Napa, CA, USA, 2005:171-180.
    [126]吴继华,王诚. Altera FPGA/CPLD设计(高级篇).北京:人民邮电出版社, 2005.
    [127] Xilinx Inc. Dual-Port Block Memory Core (version 6.3). 2005-8-31. http://www.xilinx.com/support/documentation/ipmeminterfacestorelement_ramrom_dualportblockmem.htm.
    [128] U. Meyer-Baese. Digital Signal Processing with Programmable Gate Arrays. Berlin: Springer-Verlag, 2001.
    [129] K. D. Underwood. FPGAs vs. CPUs: Trends in Peak Floating-point Performance. In Proceedings of the ACM International Symposium on Field Programmable Gate Arrays, Monterrey, California, USA, 2004:171-180.
    [130] K. D. Underwood, K. S. Hemmert. Closing the Gap: CPU and FPGA trends in Sustainable Floating-point BLAS performance. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, Napa Valley, California, USA, 2004:219-228.
    [131]高瞻. FFT处理器设计及其应用研究:学位论文.成都:西南交通大学, 2006
    [132]夏欣,贾永刚,王素珍. RBF神经网络中指数函数ex的FPGA实现.微计算机信息, 2005(20):145-146,68.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700