面向多级SPM存储的并行程序优化

英文题名：Multi-level SPM Based Parallel Programme Optimization
副题名：任务与数据的关联调度
英文副题名：Associated Scheduling of Task and Data
作者：任小广
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：片上多核 ; 便笺存储器 ; 任务调度 ; 数据分配 ; 关联调度
英文关键词：Multi一Proeessor System on ChiP ; SeratehPad Memory ; TaskSeheduling ; Data Partition ; Assoeiated Seheduling
学位年度：2010
导师：唐玉华
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

以Cache为代表的硬件管理存储层次作为解决“存储墙”问题的重要手段,在出现之初便获得了巨大成功,但随着技术和工艺的发展,逐渐暴露出诸多问题和缺陷,具体表现在性能、功耗、片上面积等方面。针对这些问题和缺陷,人们提出了软件管理存储层次方案,并进行了广泛研究。便笺存储器(SPM)作为一种软件管理片上存储器,在性能、功耗、片上面积等方面都比硬件管理的Cache有着更多的优势。
     具有多级SPM存储层次的片上多核体系结构对于片上多核系统的性能提升有着重要意义。本文针对多级SPM存储层次下片上多核体系结构中的并行程序优化方法进行了研究,提出了三种启发式任务与数据关联调度算法。首先是局部贪婪LGA关联调度算法,该算法在对任务进行表调度的基础上,对数据采用局部贪婪策略进行分配。针对LGA算法中局部贪婪策略带来的缺陷,本文又提出了全局预测GPA关联调度算法。同时,为开发循环迭代间的并行性,本文基于循环依赖重组技术,提出了旋转调度分配RPSA关联调度算法。针对上述三种算法,本文进行了编译实现,为面向多级SPM的并行程序优化提供了自动处理平台。
     同时,体系结构软件模拟技术能够为多级SPM存储层次的片上多核体系结构研究提供重要的模拟平台,也是本文任务与数据关联调度算法研究的必要验证手段。为此,本文基于SimpleScalar模拟器开发了多级SPM存储层次的片上多核系统模拟平台Sim-SPM。在Sim-SPM模拟器的设计和实现过程中,成功解决了多核环境、虚拟SPM空间、软件存储管理指令扩展等诸多问题。对现有指令集的兼容性设计使得Sim-SPM模拟器更具实用性。针对模拟器的模拟速度和模拟精度问题,本文进行了详细测试验证,实验结果表明Sim-SPM模拟器能够满足多级SPM存储层次下片上多核系统的模拟验证需求。
     最后,基于Sim-SPM模拟平台,选取多个典型测试程序对本文提出的三个算法从多个角度进行了详细测试和验证,包括初步性能对比、与近似最优方案比较、目标体系结构对算法性能影响等。实验结果表明,本文提出的一系列任务与数据关联调度算法对于多级SPM片上多核系统下的并行程序优化研究是具有指导意义的。
The hardware-managed memory hierarchy, which is represented by cache memory hierarchy, has gain great success as a solution to the "Memory Wall" problem. However, with the development of technics and technology, lots of limitations and defects of the problem emerge gradually, specifically in the aspects of performance, power consumption, chip area and so on. To solve these problems, the software-managed memory hierarchy has been proposed, which has attracted many researchers’attention. Scratchpad Memory (SPM), as a software-managed on-chip memory, shows more advantages than hardware management based cache in performance, power consumption, chip area and other aspects.
     Multi-level scratchpad memory hierarchy has great significance for the performance of on-chip multi-processor. The paper mostly focuses on the study of parallel optimization of the on-chip multi-processor system with multi-level scratchpad memory hierarchy, and proposed three algorithms with heuristic policy for associated scheduling of task and data. The first algorithm is a local greedy scheduling algorithm (LGA). The algorithm gets the data partition based on the list scheduling of tasks with local greedy policy. To remedy the defects caused by the local greedy policy in LGA, a second global prediction scheduling algorithm (GPA) was introduced. The paper also researched a rotation scheduling algorithm for the development of inter-iteration parallelism, which is based on the dependent retiming technology. The author implemented the algorithm in a compiler, which automatically carries out the parallel optimization method.
     Meanwhile, the software architectural simulation technology is important to the research of multi-level scratchpad memory hierarchy, which serves as a means of validation of the algorithms proposed in this paper. The author developed Sim-SPM, a simulator for the research of multi-processor on chip with multi-level scratchpad memory hierarchy, which is based on the SimpleScalar simulator. During the design and development process, the author successfully resolved several obstacles of implementing the Sim-SPM simulator, such as the multi-processor environment simulation, extensions of memory management instructions and the virtual memory space simulation of scratchpad memory. The compatibility of the existing instruction set in Sim-SPM makes the simulator quite practicable. To attain simulation speed and accuracy of the simulator, the author carried out a detailed evaluation test and validated that the Sim-SPM simulator can meet very well the verification needs of the on-chip multi-processor with multi-level scratchpad memory hierarchy.
     Finally, the author tested the three algorithms proposed in this paper with several typical benchmarks on the Sim-SPM simulator. Several aspects had been verified through the experiment, including a preliminary performance comparison, comparion with the approximate optimal solution, the influence of target architecture on algorithm performance and so on. The experimental results show that the proposed algorithms for task scheduling and data partition have a guiding significance for the software management of multi-scratchpad memory hierarchy on the multi-processor system on-chip.

引文

[1] John L. Hennessy and David A. Patterson. Computer Architecture: A Quanti-tative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2007.
    [2] Keith Boland and Apostolos Dollas. Predicting and precluding problems with memory latency. IEEE Micro, 14(4):59-67, 1994.
    [3] Philip Machanick. Approaches to Addressing the Memory Wall. Technicalreport, School of IT and Electrical Engineering, University of Queensland, 2002.
    [4] Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News, 23(1):20-24, 1995.
    [5] Peter J. Denning. The locality principle. Communications of the ACM, 48(7):19-24, 2005.
    [6] David A. Patterson , John L. Hennessy, Computer architecture: a quantitative approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1990
    [7] Steven Przybylski. Cache and Memory Hierarchy Design: A Performance Directed Approach. 1990.
    [8] Wei fen Lin, Steven K. Reinhardt, and Doug Burger. Reducing dram latencies with an integrated memory hierarchy design. In HPCA '01: In Proceedings of the 7th International Symposium on High Performance Computer Architecture, 2001.
    [9] Ruud van der Pas. Memory herarchy in Cache-based systems. Technical report, 2002.
    [10] A.J.Smith. Cache memories. ACM Computing Surveys, 14(3):473-530, 1982.
    [11] Goodman. Using Cache memory to reduce processor/memory trace. In In Proceedings of the Tenth International Symposium on Computer Architecture, pages 124-131, 1983.
    [12] Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and PeterMarwedel. Scratchpad memory: design alternative for Cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth internationalsymposium on Hardware/software codesign, pages 73-78, New York, NY, USA, 2002. ACM.
    [13] M. B. Kamble and K. Ghose. Analytical energy dissipation models for low power caches. In Proceedings of International Symposium on Low Power Electronics and Design, pages 143-148, 1997.
    [14] Peter Marwedel, Lars Wehmeyer, Manish Verma, Stefan Steinke, and UrsHelmig. Fast, predictable and low energy memory referencesthrougharchitecture-aware compilation. In Proceedings of the 2004 Asia and South Pacic Design Automation Conference, pages 4-11, 2004.
    [15] P. Marwedel and L. Wehmeyer. Influence of memory hierarchies on predictability for time constrained embedded software. In Proceedings of the conference on Design, Automation and Test in Europe, pages 600-605, 2005.
    [16] J.D. Gee and A.J. Smith. The performance impact of vector processor cashes. In Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.
    [17] Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo López-Lagunas, Peter R. Mattson, and John D. Owens. Bandwidth-effcient architecture for media processing. In Proceedings of the 31st Annual International Symposium on Microarchitecture, pages 3-13, 1998.
    [18] Peter Raymond Mattson. A programming system for the imagine media processor. PhD thesis, Stanford University, Stanford, CA, USA, 2002. Adviser William J. Dally.
    [19] William J. Dally, Ujval J. Kapasi, Brucek Khailany, Jung Ho Ahn, and Abhishek Das. Stream processors: Programmability and effenciency. Queue, 2(1):52-62, 2004.
    [20] Swagato Basumallick and Kelvin Nilsen. Cache issues in real-time systems, 1994.
    [21] Richard M. Russell. Communications of the ACM, 21(1):63-72, 1978.
    [22] Srinivas K. Raman, Vladimir Pentkovski, and Jagannath Keshava. Implementing streaming simd extensions on the pentium iii processor. IEEE Micro, 20(4):47-57, 2000.
    [23] Intel. Intel Pentium Processor with MMX Technology Documentation, http://www. intel.com/design/archives/processors/mmx.
    [24] Power ISA v.2.03, http://www.power.org/resources/downloads/powerisa203.pu blic .pdf.
    [25] John D. Owens, Ujval J. Kapasi, Peter Mattson, Brian Towles, Ben Serebrin, Scott Rixner, and William J. Dally. Media processing applications on the imagine stream processor. In Proceedings of the IEEE International Conference on Computer Design, pages 295-302, September 2002.
    [26] Xuejun Yang, Xiaobo Yan, Zuocheng Xing, Yu Deng, Jiang Jiang, and Ying Zhang,A 64-bit stream processor architecture for scientic applications. In ISCA '07: Proceedings of the 34th annual international symposium on Com-puter architecture, pages 210-219, ACM, 2007.
    [27] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle,A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak,M.Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K.Yazawa. The design and implementation of arst-generation cell processor.In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC.2005 IEEE International, pages 184-592 Vol. 1, 2005.
    [28] Koch K. the new roadrunner supercomputer: What, when, and how. In Pro ceedings of International Conference on High Performance Computing, 2006.
    [29] J. Makino, K. Hiraki, and M. Inaba. Grape-dr: 2-pops massively-parallel computer with 512-core, 512-groups processor chips for scientiˉc computing. In In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing,page 1–11, ACM, 2007.
    [30] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, and Jung-Ho Ahn et al. Merrimac: Supercomputing with streams. In SC '03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35.IEEE Computer Society, 2003.
    [31] Mattan Erez, Jung Ho Ahn, Ankit Garg, William J. Dally, and Eric Darve.Analysis and performance results of a molecular modeling application on merrimac. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 42, Washington, DC, USA, 2004, IEEE Computer Society.
    [32] NVIDIA. CUDA, http://developer.nvidia.com/object/cuda.html.
    [33] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, SainZee Ueng, John A. Stratton, and Wen-mei W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO'08: Proceedings of the sixthannual IEEE/ACM international symposium on Code Generation and optimization, pages 195-204, New York, NY, USA, 2008, ACM.
    [34] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J.Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler framework foroptimization of loop nests for gpgpus. In ICS'08: Proceedings of the 22nd annual international conference on Supercomputing, pages 225-234, New York, NY, USA, 2008, ACM.
    [35] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorikhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. Optimization principles and applicationperformance evaluation of a multithreaded gpu using cuda. In PPoPP'08: Proceedings of the 13th ACM SIGPLAN Symposium on principles and practice of parallel programming, pages 73-82, New York, NY, USA, 2008, ACM.
    [36] Motorola/Freescale. MPC500 32-bit MCU Family, http://www.freescale.com /les/microcontrollers/doc/factsheet/mpc500fact.pdf.
    [37] Philips. LPC2290 16/32-bit Embedded CPU, http://www.semiconductors. philips.com/acrobatdownload/datasheets/lpc2290-01.pdf, 2004.
    [38] Intel. Intel XScale(R) PXA27x Processor Family, http://www.intel.com /design/pc a/probref/253820.htm.
    [39] Marvell. Marvell PXA320 Processor Series Brief, http://www.marvell.com /products/ce llular/application/pxa320 pbr4.pdf.
    [40] Motorola/Freescale. Dragonball MC68SZ328 32-bit Embedded CPU, http://www. freescale.com/les/32bit/doc/fact sheet/mc68sz328fs.pdf.
    [41] Motorola. Motorola ColdFire MCF5 processor family, http://e-www. motorola.com.
    [42] ARM. ARM10E Family, http://www.arm.com/products/cpus/families/arm 10efamily. html.
    [43] Motorola. CPU12 Reference Manual, http://e-www.motorola.com/brdata/ pdfdb/microcontrollers/16bit/68hc12family/refmat/cpu12rm.pdf.
    [44] Motorola. M-CORE-MMC2001 Reference Manual, http://www.motorola. com/sps/ mcore/infodocumentation.htm.
    [45] Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. E±cient uti-lization of scratch-pad memory in embedded processor applications. In EDTC'97: Proceedings of the 1997 European conference on Design and Test, page 7, Washington, DC, USA, 1997, IEEE Computer Society.
    [46] Jan Sjodin, Bo Froderberg, and Lindgren Thomas. Allocation of global data objects in on-chip ram, Compiler and Architecture Support for Embedded Computing Systems, 1998.
    [47] Jan Sjodin and Carl von Platen. Storage allocation for embedded processors.In CASES '01: Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, pages 15-23, New York, NY,USA, 2001. ACM.
    [48] Oren Avissar, Rajeev Barua, and Dave Stewart, Heterogeneous memory management for embedded systems. In CASES '01: Proceedings of the 2001 in-ternational conference on Compilers, architecture, and synthesis for embedded systems, pages 34-43, New York, NY, USA, 2001, ACM.
    [49] Oren Avissar, Rajeev Barua, and Dave Stewart, An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput.Syst, 1(1):6-26, 2002.
    [50] Jason D. Hiser and Jack W. Davidson. Embarc: an efecient memory bank assignment algorithm for retargetable compilers. In LCTES '04: Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 182-191, New York, NY, USA, 2004, ACM.
    [51] Erik G. Hallnor and Steven K. Reinhardt, A fully associative software-managed cache design. SIGARCH Comput, Archit. News, 28(2):107-116, 2000.
    [52] Csaba Andras Moritz, Matthew Frank, and Saman P. Amarasinghe. Flexcache: A framework forexible compiler generated data caching. In IMS '00: Revised Papers from the Second International Workshop on Intelligent MemorySystems, pages 135-146, London, UK, 2001, Springer-Verlag.
    [53] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif, and A.Parikh. Dynamic management of scratch-pad memory space. In DAC '01: Proceedings of the 38th conference on Design automation, pages 690-695, New York, NY, USA, 2001, ACM.
    [54] Sumesh Udayakumaran and Rajeev Barua. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 276-286, New York, NY, USA, 2003, ACM.
    [55] Lian Li, Lin Gao, and Jingling Xue. Memory coloring: A compiler approach for scratchpad memory management. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 329-338, 2005.
    [56] Lian Li, Quan Hoang Nguyen, and Jingling Xue. Scratchpad allocation for data aggregates in superperfect graphs. In Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 207-216, ACM, 2007.
    [57] Michael D. Smith, Norman Ramsey, and Glenn Holloway, A generalized algorithm for graph-coloring register allocation. In PLDI '04: Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pages 277-288, ACM, 2004.
    [58] Ozcan Ozturk, Mahmut Kandemir, Mary Jane Irwin, and Suleyman Tosun.Multi-level on-chip memory hierarchy design for embedded chip multiprocessors. In ICPADS '06: Proceedings of the 12th International Conference on Parallel and Distributed Systems, pages 383-390, Washington, DC, USA, 2006.IEEE Computer Society.
    [59] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem,Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: programming the memory hierarchy. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 83, New York, NY, USA, 2006. ACM.
    [60] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy,J.Ramanujam, Atanas Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 1-10, NewYork, NY, USA, 2008, ACM.
    [61] John L. Perform ance evaluation: Techniques, tools and benchmarks. 2002. http://lea.ece.utexas.edu/pubs/jolm_perfeva1.Pdf
    [62] http://www.simplescalar.com/
    [63] Emer J, Ahuja P.Asim: A performance model framework.IEEE Computer, 2002, 35(2): 68-76
    [64] Rosenblum M, Bugnion E.Using the SimOS machine simulator to study complex computer systems.ACM Trans. on Modeling and Computer Simulation, 1997, 7(1):78-103
    [65] Roland W, Thomas W. SAMRTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In: Proc. ofthe 30th Annual Int’1 Symp. on Computer Architecture(ISCA 2003).2003. httpl://www.ece.cmu.edu/~simflex /ubl/smarts2003is ca.pdf
    [66] R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.
    [67] L. Zhang, M. Qiu, W. Tseng and E. H.-M. Sha, Variable partitioning and scheduling for MPSOC with virtually shared scratch pad memory, Journal of Signal Processing Systems, pp. 1-19, 2009.
    [68] Chao, L.-F., & Sha, E. H.-M. (1997). Scheduling data-flow graphs via retiming and unfolding. IEEE Transactions on Parallel and Distributed Systems, 8(12), 1259–1267.
    [69] Chao, L.-F., LaPaugh, A. S.,&Sha, E.H.-M. (1997). Rotation scheduling: A loop pipelining algorithm. IEEE Transactins on Computer-Aided Design, 16(3), 229–239.
    [70] T. Austin“SimpleScalar Hacker"s Guide”(for tool set release 2.0), pp. [online] Available: www.simplescalar.com
    [71] Li Y, Hempstead M. Power and thermal effect of SRAM vs. 1atch—max design styles and clock gating choices. In: Proc. of the ISLPED 2005. SanDiego, 2005, http://www.cs.virginia. edu/skadron/Papers/islped05.1i.pdf
    [72] Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., & Brown R. B. (2001). Mibench: A free, commercially representative embedded benchmark suite. In WWC’01: Proceedings of the workload characterization, 2001. WWC-4. 2001 IEEE international workshop (pp. 3–14).
    [73] Vallerio, K. S., & Jha, N. K. (2003), Task graph extraction for embeddedsystem synthesis. In VLSID’03: Proceedings of the 16th international conference on VLSI design (p. 480).
    [74] Valgrind (2009). Valgrind homepage, http://www.valgrind.org.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700