面向服务的异构多核片上系统的关键技术研究及实现

英文题名：Study and Implementation of Service Oriented Heterogeneous Multi Processor System-on-Chip
作者：冯晓静
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：异构多核片上系统 ; 面向服务的异构多核体系结构 ; 动态部分重构 ; Amdahl定律 ; 任务级并行 ; 动态调度算法 ; 有色Petri网 ; 形式化建模方法
英文关键词：heterogeneous Multi-Processor System-on-Chip (MPSoC) ; Service-Oriented heterogeneous Multi-Processor(SOMP)architecture ; Dynamlc
英文关键词：Partial Reconfiguration(DPR) ; Amdahl's law ; task level parallelism ; dynamic
英文关键词：scheduling scheme ; Colored Petri Nets(CPN) ; formal modeling methods
学位年度：2013
导师：周学海
学科代码：081201
学位授予单位：中国科学技术大学
论文提交日期：2013-05-01

摘要

随着半导体制造工艺的发展,异构多核片上系统可以集成多个具有不同功能的处理器核,并充分利用各处理器核的性能优势对多种特定类型的任务进行加速,能够同时满足嵌入式应用在芯片面积、计算性能、功耗等多方面的需求,因而被广泛应用于嵌入式计算领域。然而,随着嵌入式应用的日益普及和芯片集成度的不断提高,异构多核片上系统的设计面临着越来越严峻的挑战,其中包括：1)如何提高系统硬件平台的灵活性,以适应经常变化的嵌入式应用需求；2)如何提高系统中处理器核的应用效率,以充分利用系统的性能潜力；3)如何为异构多核片上系统建立形式化模型,特别是准确刻画其复杂的动态行为特征,以帮助设计人员对系统软/硬件设计的正确性和性能进行验证和评估。
     本文针对上述问题分别展开研究,所做的主要工作及创新特色包括：
     1)通过在系统软/硬件设计流程、模块通信接口与并行编程模型等方面提供对动态部分重构特性的支持,将动态部分重构技术与面向服务的异构多核(SOMP)体系结构结合在一起,以提高系统的灵活性。本文在基于现场可编程门阵列(FPGA)的开发板上实现了具有动态部分重构特性的SOMP原型系统,从而验证了本文提出的设计方法的正确性。此外,为了证明动态部分重构特性的有效性,本文还通过实验对原型系统的重构开销和动态部分重构特性引入的资源开销等进行了评估。
     2)本文对Amdahl定律进行了扩展,将Amdahl定律应用于异构多核片上系统。根据扩展,我们对处理器核配置、可并行化任务的比重、任务划分策略等因素对系统性能的影响进行了量化分析,并从理论上指出了为获得系统最高性能应满足的条件。该扩展对优化异构多核片上系统设计并提高系统应用效率具有指导意义。
     3)充分发掘应用的任务级并行度是提高多核系统应用效率的有效方法之一。为此,本文对异构多核片上系统的自动化任务并行方法进行了研究。通过将多核任务视为抽象指令,我们首次将指令级的记分牌算法扩展到异构多核片上系统,提出了任务级记分牌调度方法。该调度方法能动态地检测任务间的依赖关系并自动将任务分配到不同的处理器核上并行执行,以提高应用的任务级并行度。我们在SOMP原型系统上实现了任务级记分牌调度方法,并通过实验将它与其它先进的动态调度方法进行了对比。对比实验的结果证明了任务级记分牌调度算法具有较低的运行时调度开销。
     4)本文使用形式化语言——有色Petri网对异构多核片上系统进行了建模。与以往的研究工作不同的是,本文提出的有色Petri网模型不仅可以描述系统的处理器核、存储单元和任务等静态元素,还能够对系统运行过程中的任务依赖性检测、任务分配等动态行为进行刻画,因而可用于对系统的任务执行过程及相应的调度算法进行建模。借助于该模型,设计人员可以尽早对异构多核片上系统的软/硬件设计进行性能评估。本文使用该模型对SOMP原型系统及其任务乱序执行过程进行了仿真。我们通过对比和分析模型仿真结果与原型系统的实际运行情况,证明了本文提出的有色Petri网模型可以准确地仿真异构多核片上系统的执行过程。
The development of semiconductor manufacturing technology allows a heterogeneous Multi-Processor System-on-Chip (MPSoC) integrating multiple processor cores of various functionalities in it. Thus, heterogeneous MPSoCs can make full advantage of each core to accelerate multiple specified categories of tasks, and thereby simultaneously satisfies the requirements of embedded applications on many aspects, such as chip area, computing performance and power consumption. For the reason above, heterogeneous MPSoCs are widely applied in embedded computing domains. However, the popularity of embedded applications and the improvement of chip integration impose more significant challenges on heterogeneous MPSoC design. The challenges include:1) how to improve the flexibility of heterogeneous MPSoC hardware platform to satisfy the frequently changing requirements of embedded applications,2) how to improve the utilization efficiency of processor cores to make full use of the performance potential of MPSoCs and,3) how to formally modeling heterogeneous MPSoCs, especially accurately describing their complicated characteristics of dynamic events, so as to assist designers to verify the correctness and evaluate the performance of system hardware/software designs.
     This dissertation studies the above problems separately. The study and its innovative features involve:
     1) In order to increase the system flexibility, we combine Dynamic Partial Reconfiguration (DPR) technology with Service-Oriented heterogeneous Multi-Processor (SOMP) architecture by providing a system hardware/software designing flow, an inter-module communication interface and a parallel programming model which support DPR. The SOMP prototyping system is implemented on a Field Programmable Gate Array (FPGA) based development board, which demonstrates the correctness of our proposed designing method. Besides, for demonstrating the effectiveness of DPR, the reconfiguration overhead of the prototyping system and the resource overhead introduced by DPR are also evaluated in this dissertation.
     2) This dissertation extends Amdahl's law and applies it into heterogeneous MPSoCs. According to the extension, we quantitatively analyze the impacts on system performance of different factors, such as the configuration of processor cores, the proportion of parallelizable tasks and task partitioning strategy. Then we point out the conditions in theory which will lead to the maximum system performance. The extension is of guiding significance to design optimizing of heterogeneous MPSoCs and improving the system efficiency.
     3) Fully revealing task level parallelism of applications is an efficient way of improving the utilization efficiency of multicore systems. To this end, the scheme for automatically parallelizing tasks is studied in this dissertation. By regarding tasks as abstract instructions, we extend instruction level scoreboarding algorithm to heterogeneous MPSoCs for the first time and propose a dynamical scheduling scheme named Task Level Scoreboarding (TLS). TLS can dynamically detect inter-task dependencies and automatically dispatch tasks to different cores to execute in parallel. Therefore, the task level parallelism of applications can be improved. TLS is implemented on the SOMP prototyping system and compared to other state-of-art dynamic scheduling schemes through experiments. The result of comparison demonstrates our proposed TLS introduces less runtime scheduling overhead.
     4) This dissertation models heterogeneous MPSoCs using a formal modeling language, Colored Petri Nets (CPN). Unlike previous research work, the CPN model proposed in this dissertation can describe not only static elements of a system such as processor cores, memory units and tasks, but also dynamic events during system execution which includes detecting inter-task dependencies and task dispatching. So the model can be employed to model task execution process in a system and the corresponding scheduling algorithm. Assisted by the model, designers are able to evaluate the performance of heterogeneous MPSoC hardware/software designs as early as possible. This dissertation utilizes the model to simulate the SOMP prototyping system, together with out-of-order task execution processes on it. By comparing the result of model-based simulation and actual execution of the prototyping system, we demonstrate that the proposed CPN model can accurately simulate the execution of heterogeneous MPSoCs.

引文

陈书明,陈胜刚,尹亚明(2012)."Amdahl定律在层次化片上多核处理器中的扩展.”计算机研究与发展49(1)：83-92.
    高妍妍(2009). ASIP体系结构形式化建模与验证方法研究,中国科学技术大学.
    王超(2011).异构多核可重构片上系统关键技术研究[D],中国科学技术大学.
    (2004). "Nomadik-Open multimedia platform for next generation mobile devices." STMicroelectronics, available from http://www. st.com/.
    (2005). "Intel IXP2855 Network Processor." Intel Corp., available from http://www.intel.com.
    (2005). "The International Technology Roadmap for Semiconductors." Semiconductor Industry Association, available from http://www.itrs.net/Links/2005itrs/home2005.htm.
    (2006). "Intel's Threading Building Blocks Tutorial." Intel, available from http://www.threadingbuildingblocks.org/documentation.php
    (2006). "Intel Threading Building Blocks for Open Source." Intel Corp., available from http://threadingbuildingblocks.org/.
    (2009). "EEMBC benchmark suite." Embedded Microprocessor Benchmark Consortium, available from http://www. eembc.org/.
    (2010). "Intel Atom Processor E6x5C Series." Intel Corp., available from http://www. intel. com.
    (2010). "OMAPTM 4 mobile applications platform." Texas Instruments, available from http://www.ti.com/.
    (2010). "TEGRA 2 Super Chip Processors." NVIDIA, available from http://www.nvidia.com/object/tegra-superchip.html.
    (2012). "Zynq-7000 All Programmable SoC." Xilinx, available from http://www.xilinx.com/.
    Abdeddaim, Y., A. Kerbaa and O. Maler (2003). Task graph scheduling using timed automata. Proceedings of Parallel and Distributed Processing Symposium.
    Alur, R. and D. L. Dill (1994). "A theory of timed automata." Theoretical Computer Science 126(2):183-235.
    Alur, R., S. La Torre and G. J. Pappas (2004). "Optimal paths in weighted timed automata." Theoretical Computer Science 318(3):297-322.
    Azgomi, M. A. and R. Entezari-Maleki (2010). "Task scheduling modelling and reliability evaluation of grid services using coloured Petri nets." Future Generation Computer Systems 26(8):1141-1150.
    Banerjee, U. (1988). "An introduction to a formal theory of dependence analysis." The Journal of Supercomputing 2(2):133-149.
    Barreto, R., P. Maciel, M. Neves, et al. (2004). A novel approach for off-line multiprocessor scheduling in embedded hard real-time systems. Design Methods and Applications for Distributed Embedded Systems. B. Kleinjohann, G. R. Gao, H. Kopetz, L. Kleinjohann and A. Rettberg. New York, Springer.150:157-166.
    Behrmann, G., A. Fehnker, T. Hune, et al. (2001). Minimum-Cost Reachability for Priced Timed Automata. Proceedings of the 4th International Workshop on Hybrid Systems:Computation and Control, Springer-Verlag.
    Berekovic, M. and T. Niggemeier (2008). "A Distributed, Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance." Journal of Signal Processing Systems 50(2):201-229.
    Blanke, K., D. Krafzig and D. Slama (2004). Enterprise SOA:Service Oriented Architecture Best Practices, Prentice Hall International.
    Blej, M. and M. Azizi (2009). "Modeling and Analysis of a Real-time System Using the Networks of Extended Petri." Journal of Computers 4(7):641-645.
    Blume, W. and R. Eigenmann (1994). The range test:a dependence test for symbolic, non-linear expressions. Supercomputing '94. Proceedings.
    Blumofe, R. D., C. F. Joerg, B. C. Kuszmaul, et al. (1995). "Cilk:an efficient multithreaded runtime system." SIGPLAN Not.30(8):207-216.
    Borkar, S. (2007). Thousand core chips:a technology perspective. Proceedings of the 44th annual Design Automation Conference, ACM.
    Chamberlain, R. D., M. A. Franklin, E. J. Tyson, et al. (2010). "Auto-Pipe: Streaming Applications on Architecturally Diverse Systems." Computer 43(3): 42-49.
    Chen, C. (2005). BEE2:A High-End Reconfigurable Computing System. W. John and W. B. Robert.22:114-125.
    Chen, M. K. and K. Olukotun (2003). The Jrpm system for dynamically parallelizing Java programs. Computer Architecture,2003. Proceedings.30th Annual International Symposium on.
    Cho, S. and R. G. Melhem (2008). "Corollaries to Amdahl's law for energy." Computer Architecture Letters 7(1):25-28.
    Cho, S. and R. G. Melhem (2010). "On the interplay of parallelization, program performance, and energy consumption." Parallel and Distributed Systems, IEEE Transactions on 21(3):342-353.
    Chung, E. S., P. A. Milder, J. C. Hoe, et al. (2010). Single-Chip Heterogeneous Computing:Does the Future Include Custom Logic, FPGAs, and GPGPUs? 43rd Annual IEEE/ACM International Symposium on Micro architecture (MICRO 2010).
    Clarke, E. M. and J. M. Wing (1996). "Formal methods:state of the art and future directions." ACM Comput. Surv.28(4):626-643.
    Dagum, L. and R. Menon (1998). "OpenMP:an industry standard API for shared-memory programming." Computational Science & Engineering, IEEE 5(1): 46-55.
    De Oliveira, J. A. and H. Van Antwerpen (2003). The Philips Nexperia digital video platform. Winning the SoC Revolution, Springer:67-96.
    DeHon, A., Y. Markovsky, E. Caspi, et al. (2006). "Stream computations organized for reconfigurable execution." Microprocessors and Microsystems 30(6): 334-354.
    Engelen, R. A. v., J. Birch, Y. Shou, et al. (2004). A unified framework for nonlinear dependence testing and symbolic analysis. Proceedings of the 18th annual international conference on Supercomputing. Malo, France, ACM:106-115.
    Eskinazi, R., M. E. Lima, P. R. M. Maciel, et al. (2005). A Timed Petri Net Approach for Pre-Runtime Scheduling in Partial and Dynamic Reconfigurable Systems. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), IEEE Computer Society:330-337.
    Etsion, Y., F. Cabarcas, A. Rico, et al. (2010). Task Superscalar:An Out-of-Order Task Pipeline. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society:89-100.
    Eyerman, S. and L. Eeckhout (2010). Modeling critical sections in Amdahl's law and its implications for multicore design. ACM SIGARCH Computer Architecture News, ACM.
    Fahringer, T. and B. Scholz (1997). Symbolic evaluation for parallelizing compilers. Proceedings of the 11th international conference on Supercomputing. Vienna, Austria, ACM:261-268.
    Gropp, W, E. Lusk and A. Skjellum (1999). Using MPI:portable parallel programming with the message passing interface, MIT press.
    Gupta, M., S. Mukhopadhyay and N. Sinha (2000). "Automatic Parallelization of Recursive Procedures." International Journal of Parallel Programming 28(6): 537-562.
    Gustafson, J. L. (1988). "Reevaluating Amdahl's law." Communications of the ACM 31(5):532-533.
    Haghighat, M. R. and C. D. Polychronopoulos (1996). "Symbolic analysis for parallelizing compilers." ACM Trans. Program. Lang. Syst.18(4):477-518.
    Hammond, L., B. A. Hubbert, M. Siu, et al. (2000). "The Stanford Hydra CMP." Micro, IEEE 20(2):71-84.
    Hennessy, J. L. and D. A. Patterson (2011). Computer architecture:a quantitative approach, Morgan Kaufmann.
    Hill, M. D. and M. R. Marty (2008). "Amdahl's Law in the Multicore Era.1 Computer 41(7):33-38.
    Hoheisel, A. and U. Der (2003). Dynamic Workflows for Grid Applications. Proceedings of the 3rd Cracow Grid Workshop.
    Ismail, A. and L. Shannon (2011). FUSE:Front-end user framework for O/S abstraction of hardware accelerators. Field-Programmable Custom Computing Machines (FCCM),2011 IEEE 19th Annual International Symposium on, IEEE.
    Jensen, K. (1994). An introduction to the theoretical aspects of Coloured Petri Nets. A Decade of Concurrency Reflections and Perspectives. J. de Bakker, W. de Roever and G. Rozenberg, Springer Berlin/Heidelberg.803:230-272.
    Jensen, K., L. Kristensen and L. Wells (2007). "Coloured Petri Nets and CPN Tools for modelling and validation of concurrent systems." International Journal on Software Tools for Technology Transfer (STTT) 9(3):213-254.
    Kahle, J. A., M. N. Day, H. P. Hofstee, et al. (2005). "Introduction to the Cell multiprocessor." IBM Journal of Research and Development 49(4.5):589-604.
    Karim, F., A. Mellan, A. Nguyen, et al. (2004). "A multilevel computing architecture for embedded multimedia applications." Micro, IEEE 24(3):56-66.
    Katherine, C. and H. Scott (2002). Reconfigurable computing:a survey of systems and software, ACM.34:171-210.
    Kim, M., H. Kim and C.-K. Luk (2010). SD3:A Scalable Approach to Dynamic Data-Dependence Profiling. Micro architecture (MICRO),2010 43rd Annual IEEE/ACM International Symposium on.
    Kong, X., D. Klappholz and K. Psarris (1991). "The I test:an improved dependence test for automatic parallelization and vectorization." Parallel and Distributed Systems, IEEE Transactions on 2(3):342-349.
    Kuck, D., E. Davidson, D. Lawrie, et al. (1993). "The cedar system and an initial performance study." SIGARCH Comput. Archit. News 21(2):213-223.
    Kumar, S., C. J. Hughes and A. Nguyen (2007). Carbon:architectural support for fine-grained parallelism on chip multiprocessors. Proceedings of the 34th annual international symposium on Computer architecture. San Diego, California, USA, ACM:162-173.
    Kwok, Y.-K. and I. Ahmad (1999). "Static scheduling algorithms for allocating directed task graphs to multiprocessors." ACM Computing Surveys (CSUR) 31(4): 406-471.
    Larus, J. R. (1993). "Loop-level parallelism in numeric and symbolic programs." Parallel and Distributed Systems, IEEE Transactions on 4(7):812-826.
    Lin, Y. and D. Padua (2000). Analysis of Irregular Single-Indexed Array Accesses and Its Applications in Compiler Optimizations. D. Watt, Springer Berlin/ Heidelberg.1781:202-218.
    Liu, W., J. Tuck, L. Ceze, et al. (2006). POSH:a TLS compiler that exploits program structure. Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. New York, New York, USA, ACM: 158-167.
    Lubbers, E. and M. Planner (2007). ReconOS:An RTOS Supporting Hard-and Software Threads. Field Programmable Logic and Applications,2007. FPL 2007. International Conference on.
    Madhukar, M, M. Leuze and L. Dowdy (1995). Petri net model of a dynamically partitioned multiprocessor system. Proceedings of the Sixth International Workshop on Petri Nets and Performance Models, IEEE Computer Society.
    Mattson, T. G., R. Van der Wijngaart and M. Frumkin (2008). Programming the Intel 80-core network-on-a-chip terascale processor. Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press.
    Maydan, D. E., J. L. Hennessy and M. S. Lam (1991). "Efficient and exact data dependence analysis." SIGPLAN Not.26(6):1-14.
    Moncrieff, D., R. E. Overill and S. Wilson (1996). "Heterogeneous computing machines and Amdahl's law." Parallel Computing 22(3):407-413.
    Morad, T. Y, U. C. Weiser, A. Kolodnyt, et al. (2006). "Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors." Computer Architecture Letters 5(1):14-17.
    Murata, T. (1989). "Petri nets:Properties, analysis and applications." Proceedings of the IEEE 77(4):541-580.
    Neubauer, F., A. Hoheisel and J. Geiler (2006). "Workflow-based Grid applications." Future Generation Computer Systems 22(1-2):6-15.
    Paek, Y, J. Hoeflinger and D. Padua (2002). "Efficient and precise array access analysis." ACM Trans. Program. Lang. Syst.24(1):65-109.
    Paul, J. M. and B. H. Meyer (2007). "Amdahl's law revisited for single chip systems." International Journal of Parallel Programming 35(2):101-123.
    Peck, W, E. Anderson, J. Agron, et al. (2006). Hthreads:A Computational Model for Reconfigurable Devices. Field Programmable Logic and Applications, 2006. FPL'06. International Conference on.
    Petersen, P. M. and D. A. Padua (1993). Dynamic dependence analysis:A novel method for data dependence evaluation. Languages and Compilers for Parallel Computing.757:64-81.
    Pham, D., S. Asano, M. Bolliger, et al. (2005). The design and implementation of a first-generation CELL processor. Solid-State Circuits Conference,2005. Digest of Technical Papers. ISSCC.2005 IEEE International, IEEE.
    Praun, C. v., R. Bordawekar and C. Cascaval (2008). Modeling optimistic concurrency using quantitative dependence analysis. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. Salt Lake City, UT, USA, ACM:185-196.
    Pugh, W. (1991). The Omega test:a fast and practical integer programming algorithm for dependence analysis. Proceedings of the 1991 ACM/IEEE conference on Supercomputing. Albuquerque, New Mexico, United States, ACM:4-13.
    Rotenberg, E., Q. Jacobson, Y. Sazeides, et al. (1997). Trace processors. Micro architecture,1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on.
    Rus, S., M. Pennings and L. Rauchwerger (2007). "Sensitivity analysis for automatic parallelization on multi-cores." 21st International Conference on Supercomputing. ICS 07:263-273273.
    Rus, S. and L. Rauchwerger (2005). Hybrid dependence analysis for automatic parallelization. Technical Report, Dept. of CS, Texas A&MU.
    Rus, S., L. Rauchwerger and J. Hoeflinger (2003). "Hybrid Analysis:Static& Dynamic Memory Reference Analysis." International Journal of Parallel Programming 31(4):251-283.
    Rus, S., D. Zhang and L. Rauchwerger (2004). The Value Evolution Graph and its Use in Memory Reference Analysis. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society:243-254.
    Rutten, M. J., J. T. Van Eijndhoven, E. G. Jaspers, et al. (2002). "A heterogeneous multiprocessor architecture for flexible media processing." Design & Test of Computers, IEEE 19(4):39-50.
    Sanchez, D., R. M. Yoo and C. Kozyrakis (2010). Flexible architectural support for fine-grain scheduling. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems. Pittsburgh, Pennsylvania, USA, ACM:311-322.
    Singh, S. (2011). "Computing without Processors." Queue 9(6):50-63.
    So, H. K.-H. and R. Brodersen (2008). "A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH." ACM Trans. Embed. Comput. Syst.7(2):1-28.
    Sohi, G. S., S. E. Breach and T. N. Vijaykumar (1995). "Multiscalar processors." SIGARCH Comput. Archit. News 23(2):414-425.
    Srba, J. (2008). Comparing the Expressiveness of Timed Automata and Timed Extensions of Petri Nets. Proceedings of the 6th international conference on Formal Modeling and Analysis of Timed Systems. Saint Malo, France, Springer-Verlag: 15-32.
    Steffan, J. G., C. B. Colohan, A. Zhai, et al. (2000). "A scalable approach to thread-level speculation." SIGARCH Comput. Archit. News 28(2):1-12.
    Stone, J. E., D. Gohara and G. Shi (2011). "OpenCL:A Parallel Programming Standard for Heterogeneous Computing Systems." Comput Sci Eng 12(3):66-72.
    Stratton, J., S. Stone and W.-m. Hwu (2008). MCUDA:An Efficient Implementation of CUDA Kernels for Multi-core CPUs. Languages and Compilers for Parallel Computing. J. Amaral, Springer Berlin/Heidelberg.5335:16-30.
    Sun, X.-H. and Y. Chen (2010). "Reevaluating Amdahl's law in the multicore era." Journal of Parallel and Distributed Computing 70(2):183-188.
    Sun, X.-H. and L. M. Ni (1990). Another view on parallel speedup. Supercomputing'90. Proceedings of, IEEE.
    Tavares, E., M. Oliveira, P. Maciel, et al. (2006). Pre-Runtime Scheduling Considering Timing and Energy Constraints in Embedded Systems with Multiple Processors. Model-Driven Design to Resource Management for Distributed Embedded Systems. B. Kleinjohann, L. Kleinjohann, R. Machado, C. Pereira and P. Thiagarajan, Springer Boston.225:255-264.
    Tournavitis, G., Z. Wang, B. Franke, et al. (2009). "Towards a holistic approach to auto-parallelization:integrating profile-driven parallelism detection and machine-learning based mapping." SIGPLAN Not.44(6):177-187.
    Vuletic, M., L. Pozzi and P. Ienne (2004). Virtual memory window for application-specific reconfigurable coprocessors. Proceedings of the 41st annual Design Automation Conference. San Diego, CA, USA, ACM:948-953.
    Wang, C., J. Zhang, X. Zhou, et al. (2011). SOMP:Service-Oriented Multi Processors. Proceedings of the 2011 IEEE International Conference on Services Computing, IEEE Computer Society:709-716.
    Wang, J. (1998). Timed Petri nets:Theory and application, Kluwer Academic Publishers Norwell.
    Wang, Y., J. Yan, X. Zhou, et al. (2012). A Partially Reconfigurable Architecture Supporting Hardware Threads. International Conference on Field-Programmable Technology.
    Watkins, M. A. and D. H. Albonesi (2010). ReMAP:A reconfigurable heterogeneous multicore architecture. Microarchitecture (MICRO),2010 43rd Annual IEEE/ACM International Symposium on, IEEE.
    Wawrzynek, J., D. Patterson, M. Oskin, et al. (2007). "RAMP:Research Accelerator for Multiple Processors." Micro, IEEE 27(2):46-57.
    Wolf, W. (2009). "Multiprocessor system-on-chip technology." Signal Processing Magazine, IEEE 26(6):50-54.
    Wolfe, M. and C. W. Tseng (1992). "The Power Test for Data Dependence." IEEE Trans. Parallel Distrib. Syst.3(5):591-601.
    Woo, D. H. and H.-H. Lee (2008). "Extending Amdahl's law for energy-efficient computing in the many-core era." Computer 41(12):24-31.
    Wu, J., J. Williams, N. Bergmann, et al. (2009). Design Exploration for FPGA-Based Multiprocessor Architecture:JPEG Encoding Case Study.17th IEEE Symposium on Field Programmable Custom Computing Machines. FCCM'09., IEEE.
    Wu, P., A. Cohen and D. Padua (2003). Induction Variable Analysis without Idiom Recognition:Beyond Monotonicity. Languages and Compilers for Parallel Computing. H. Dietz, Springer Berlin/Heidelberg.2624:235-294.
    Wu, P., A. Kejariwal and C. Cascaval (2008). Compiler-Driven Dependence Profiling to Guide Program Parallelization. Languages and Compilers for Parallel Computing. J. N. Amaral. Berlin, Springer-Verlag Berlin.5335:232-248.
    Yao, E., Y. Bao, G. Tan, et al. (2009). "Extending Amdahl's law in the multicore era." ACM SIGMETRICS Performance Evaluation Review 37(2):24-26.
    Zhang, X., A. Navabi and S. Jagannathan (2009). Alchemist:A Transparent Dependence Distance Profiling Infrastructure. Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, IEEE Computer Society:47-58.
    Zuberek, W. M., R. Govindarajan and F. Suciu (1998). Timed Colored Petri Net Models of Distributed Memory Multithreaded Multiprocessors. Workshop on Practical Use of Colored Petri Nets and Design/CPN, Aarhus,Denmark.