多核环境下面向数据并行编程模型的性能和可伸缩性研究

英文题名：Research on the Performance and Scalability of Data-Parallel Programming Model on Multicore
作者：陈榕
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：面向数据并行编程模型 ; 多核 ; MapReduce模型 ; 分治策略
英文关键词：Data-parallel programming model ; Multicore ; MapReduce ; Tiling
学位年度：2011
导师：臧斌宇
学科代码：081201
学位授予单位：复旦大学
论文提交日期：2011-04-13

摘要

近10年来对于大规模数据处理的需求变的日益迫切,等待处理的数据如雪崩一般不断增长。据权威咨询公司IDC于2007年统计,截至2006年存储于电子介质中的数据量达到惊人的161艾字节(Exabyte),并且预计至2010年这一数字将来到998艾字节。毫无疑问数据密集型应用已经成为当今最为重要的计算机应用之一。
     与此同时,随着多核技术的日益普及,片上核数目的快速增长,多核平台在大规模数据处理领域呈现出极为广阔的应用前景。然而这些以多核形式提供的强大计算能力,只有通过并行程序才能得以充分利用,发挥出与核数目的增长一致的实际效果。高效并行程序的编写历来是困扰程序员的难题,因为除了业务逻辑本身,程序员还必须面对包括数据分布、可伸缩性、负载平衡和系统容错在内的大量与并行性相关的复杂问题。权威调研机构Gartner于2008年列出了未来25年IT市场面临的七大挑战,多核时代的并行编程位居第二。
     面向数据并行编程模型无疑是这一挑战的最好解答,通过合理的抽象向应用程序员隐藏并行性相关问题,在将业务逻辑开发留给应用程序员的同时,将实现并行的挑战留给并行计算专家。然而现有的面向数据并行编程模型和运行时支持大多针对集群平台设计和实现,并没有充分考虑到多核平台的自身特点,比如高速核间通信、共享缓存竞争和整机故障模型等,因此也就不能有效的利用多核技术带来的强大计算能力。此外,现有的并行编程模型设计更多的关注于通用性而缺乏针对性,限制了模型在某些应用领域和计算需求下的执行效果。
     本文在深入分析现有MapReduce并行编程模型在多核平台上存在的性能和可伸缩性问题的基础上,提出了一个系统的解决方案。首先以MapReduce模型为基础采用分治策略针对多核平台特点进行扩展,然后基于分治MapReduce模型提出了针对内存占用、缓存局部性和任务并行性三个方面的多个优化,最后以在线聚集计算和增量计算为例分析并验证了分治MapReduce模型对于不同领域和不同需求应用的高效支持。相对于之前的研究而言,该研究致力于设计和实现针对多核平台的面向数据并行编程模型,充分利用资源获得与之相匹配的性能和可伸缩性,并为更多的领域和应用提供高效地支持。
     具体而言,本文的主要贡献如下：
     1.从面向数据并行编程模型的角度深入分析多核平台与集群平台间存在的主要差异,并在此基础上揭示了面向集群平台设计的MapReduce并行编程模型在多核平台上存在的主要问题。提出利用分治策略对MapReduce并行编程模型进行扩展,将大型任务分解为多个子任务迭代执行,并改进原有的容错机制,以达到充分适应多核平台特点的目标。
     2.提出基于分治MapReduce模型,涉及内存、缓存和处理器三个方面的多个运行时优化。采用动态数据加载和缓冲区重用技术减少并缩短内存资源占用,采用面向非一致缓存/内存访问(NUCA/NUMA-aware)的调度策略提高缓存局部性,采用软件流水线技术(Software Pipeline)和任务窃取技术(Work Stealing)消除处理器空闲。
     3.基于分治MapReduce模型以及相关运行时优化,在多核平台设计并实现了名为Ostrich白勺原型系统。深入评测的结构表明,分治MapReduce模型的接口扩展相对于其它MapReduce模型实现并不会对程序员产生额外负担。其次,在16核Intel处理器构成的测试平台上,Ostrich运行时不但在所有基准测试中都具有更好的可伸缩性,并且在性能测试中节省高达85%的内存,降低3.1倍至7.1倍的缓存缺失率,以及提高整体性能1.2倍至3.3倍。
     4.利用分治MapReduce模型提供的强大支持,设计并实现了两个针对不同领域和不同计算需求的案例应用。Oops系统实现了对在线聚集计算的支持,能够在执行过程中向用户反馈当前进度下的近似结果,并能够高效地支持多级在线计算。Ostrichlnc系统提出在子任务级别实现计算复用,实现了对严格增量计算和部分增量计算的高效支持。评测结果表明,分治MapReduce模型在保持原有通用性的前提下,对多种不同领域和不同需求的应用能够提供高效支持。
The demand of processing large-scale data emerges and has rapidly increased over past 10 years. The dramatic growth of data is an "information avalanche". According to statistics from IDC Inc., the amount of information stored in digital form in 2006 is at 161 exabytes, and it reaches to 988 exabytes in 2010. There is no doubt that data-intensive applications have become one of the most important application domains.
     Meanwhile, with the prevalence of multicore and the continuously increasing number of cores on a single machine, large-scale data processing has been shown to be a promising alternative to harness abundant resources in multicore platforms. The most usual way of harness multicore resources is through parallel applications. Unfortunately, it is a nightmare for average programmers to write an efficient parallel application. In additional to functionality, they also have to consider many issues including data distribution, scalability, load balance and fault tolerance. As a result, Gartner Inc. has identified seven grand challenges in IT for the next 25 years, and parallel programming is ranked the second in 2008.
     Data-parallel programming model is an appealing proposal to face the chal-lenges, by providing easy-to-use abstraction to hide parallelism issues from users. However, current data-parallel programming models designed to program large clusters do not seriously consider several unique features of multicore platform, such as low-latency inter-core communication, contention for shared cache, and pervasive global failures. Furthermore, current data-parallel programming models emphasize on generality while ignoring specialization in some cases, which limits its usage in some application domains.
     Based on a detailed analysis on the performance and scalability problems of running Map Reduce programming model on multicore platform, this disserta-tion proposes a practical and systematic solution. First, we extend the general MapReduce programming model with "tiling strategy", called Tiled-MapReduce (TMR). Second, we propose three optimizations that aim at improving memory efficiency, data locality and task parallelism. Finally, we demonstrate the powerful support from Tiled-MapReduce for novel applications such as online aggregation and incremental computation.
     The main contents and contributions of this paper include the following as- pects:
     ·We analyze the performance issues resulted from differences between mul-ticore and cluster platform, and propose Tiled-MapReduce, which applies the "tiling strategy" to increase performance and scalability. The runtime divides a large MapReduce job into a number of independent small sub-jobs, and iteratively processes one sub-job at a time with efficient uses of resources. We also enhance the fault tolerance strategy for multicore plat-form.
     ·Tiled-MapReduce also enables three optimizations otherwise impossible. First, we use dynamic loading and input data buffer reuse to shorten the life cycle and limit the footprint of input and intermediate data. Second, we use a NUCA/NUMA-aware scheduler to fully exploit the memory hierarchy of a multicore system, resulting in better memory and cache locality. Finally, we use software pipeline and work stealing to eliminate the idle time and fully harness the CPU cores.
     ·We implement a prototype of Tiled-MapReduce, namely Ostrich, running on an Intel machine with 16 cores. Experiments on four different types of benchmarks show that Ostrich has much better scalability than Phoenix for all benchmarks. Meanwhile, it also saves up to 85% memory, causes less cache misses and makes more efficient uses of CPU cores, resulting in a speedup ranging from 1.2X to 3.3X.
     ·We design and implement two prototypes to show the powerful support for different application domain from Tiled-MapReduce. First, the online ag-gregation system, namely Oops, allows the users to observe the progress and approximate result of their tasks. Second, the incremental computation system, namely OstrichInc, reuses the task results in sub-job level, which supports both fully incremental computation and partially incremental com-putation.

引文

[1]Dean J, Ghemawat S. Mapreduce:simplified data processing on large clus-ters[J]. Commun. ACM,2008,51:107-113.
    [2]DBMS2. Facebook, hadoop, and hive. http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/,2009.
    [3]Ericson J.Processing avatar. http://www.information-management.com/newsletters/avatar_data_processing-10016774-1.html, 2009.
    [4]Moore's law. http://www.intel.com/technology/mooreslaw/index.htm.
    [5]Excerpts from a conversation with gordon moore:Moore's law. http://www.intel.com/technology/mooreslaw/index.htm.
    [6]Knight W. Two heads are better than one[J].IEEE Review,2005,51: 32-35.
    [7]Peng L, Peir J K, Prakash T, Chen Y K, Koppelman D. Memory per-formance and scalability of intel's and amd's dual-core processors:A case study[C]. Proceedings of IEEE International Performance, Computing, and Communications Conference, IPCCC'07. IEEE Computer Society, Wash-ington, DC, USA,2007
    [8]Muthana P, Swaminathan P, Tummala R, Sundaram V, Wan L, Bhat-tacharya S, Raj P. Packaging of multi-core microprocessors:tradeoffs and potential solutions[C]. Proceedings of the 55th Electronic Components and Technology Conference, ECTC'05. IEEE Computer Society, Washington, DC, USA,2005 1895-1903.
    [9]Gartner. Top ten disruptive technologies for 2008 to 2012. http://www.gartner.com/it/page.jsp?id=681107,2008.
    [10]strategy analytics. Multi-core processors to pen-etrate 45 percent of smartphones by 2015. http://www. strategyanalytics.com/default. aspx?mod=pressreleaseviewer&a0=4998, 2011.
    [11]Intel. Harpertown. http://ark.mtel.com/ProductCollection.aspx?codeName=26555.
    [12]Intel. Nehalem. http://www.intel.com/technology/architecture-silicon/next-gen/.
    [13]Intel. Intel quickpath technology.http://www.intel.com/technology/quickpath/.
    [14]Gartner. Seven grand challenges facing it. http://www.gartner.com/it/page.jsp?id=643117,2008.
    [15]Intel:Software needs to heed moore's law. http://www.osnews.com/story/17983/Intel-Software-Needs-to-Heed-Moores-Law/.
    [16]Hewitt C, Bishop P, Steiger R. A universal modular actor formalism for ar-tificial intelligence[C]. Proceedings of the 3rd international joint conference on Artificial intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,1973 235-245.
    [17]Li K, Hudak P. Memory coherence in shared virtual memory systems[J]. ACM Trans. Comput. Syst.,1989,7:321-359.
    [18]Keleher P, Cox A L, Zwaenepoel W. Lazy release consistency for software distributed shared memory [C]. Proceedings of the 19th annual international symposium on Computer architecture, ISCA'92. ACM, New York, NY, USA,1992 13-21.
    [19]Bershad B, Zekauskas M, Sawdon W. The midway distributed shared memo-ry system [C]. Proceedings of the 38th IEEE Computer Society International Conference. IEEE Computer Society Press, Los Alamitos, CA, USA,1993 528-537.
    [20]Johnson K L, Kaashoek M F, Wallach D A. Crl:high-performance all-software distributed shared memory[C]. Proceedings of the fifteenth ACM symposium on Operating systems principles, SOSP'95. ACM, New York, NY, USA,1995 213-226.
    [21]Dagum L, Menon R. Openmp:An industry-standard api for shared-memory programming[J]. IEEE Comput. Sci. Eng.,1998,5:46-55.
    [22]Bircsak J, Craig P, Crowell R, Cvetanovic Z, Harris J, Nelson C A, Offn-er C D. Extending openmp for numa machines[C]. Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), Supercomput-ing'00. IEEE Computer Society, Washington, DC, USA,2000
    [23]Thibault S, Broquedis F, Goglin B, Namyst R, Wacrenier P A. An efficient openmp runtime system for hierarchical architectures[C]. Proceedings of the 3rd international workshop on OpenMP:A Practical Programming Model for the Multi-Core Era, IWOMP'07. Springer-Verlag, Berlin, Heidelberg, 2008 161-172.
    [24]NVida. Cuda. http://developer.nvidia.com/object/cuda.html.
    [25]Gropp W. Using mpi-portable parallel programming with the message-passing interface[J]. Sci. Program.,1996,5:275-276.
    [26]Sunderam V S. Pvm:a framework for parallel distributed computing [J]. Concurrency:Pract. Exper.,1990,2:315-339.
    [27]Malewicz G, Austern M H, Bik A J, Dehnert J C, Horn I, Leiser N, Cza-jkowski G. Pregel:a system for large-scale graph processing[C]. Proceedings of the 2010 international conference on Management of data, SIGMOD'10. ACM, New York, NY, USA,2010 135-146.
    [28]Blumofe R D, Joerg C F, Kuszmaul B C, Leiserson C E, Randall K H, Zhou Y. Cilk:an efficient multithreaded runtime system[C]. Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP'95. ACM, New York, NY, USA,1995 207-216.
    [29]Frigo M, Leiserson C E, Randall K H. The implementation of the cilk-5 multithreaded language[C]. Proceedings of the ACM SIGPLAN 1998 con-ference on Programming language design and implementation, PLDI'98. ACM, New York, NY, USA,1998 212-223.
    [30]Reinders J. Intel threading building blocks:outfitting C++for multi-core processor parallelism[M]. O'Reilly & Associates, Inc., Sebastopol, CA, USA, first edition,2007.
    [31]Pheatt C. Intel threading building blocks[J]. J. Comput. Small Coll.,2008, 23:298-298.
    [32]Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V. X10:an object-oriented approach to non-uniform cluster computing[C]. Proceedings of the 20th annual ACM SIGPLAN con-ference on Object-oriented programming, systems, languages, and applica-tions, OOPSLA'05. ACM, New York, NY, USA,2005 519-538.
    [33]Dean J, Ghemawat S. Mapreduce:simplified data processing on large clus-ters[C]. Proceedings of the 6th conference on Symposium on Opearting Sys-tems Design & Implementation-Volume 6. USENIX Association, Berkeley, CA, USA,2004 10-10.
    [34]Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad:distributed data-parallel programs from sequential building blocks[C]. Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys'07. ACM, New York, NY, USA,2007 59-72.
    [35]Power R, Li J. Piccolo:building fast, distributed programs with partitioned tables[C]. Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10. USENIX Association, Berkeley, CA, USA,2010 1-14.
    [36]Silva D D, Krieger O, Wisniewski R W, Waterland A, Tam D, Baumann A. K42:an infrastructure for operating system research[J]. SIGOPS Oper. Syst. Rev.,2006,40:34-42.
    [37]Appavoo J, Silva D D, Krieger O, Auslander M, Ostrowski M. Rosenburg B, Waterland A, Wisniewski R W, Xenidis J, Stumm M, Soares L. Experience distributing objects in an smmp os[J]. ACM Trans. Comput. Syst.,2007, 25.
    [38]Unrau R C, Krieger O, Gamsa B, Stumm M. Hierarchical clustering:a structure for scalable multiprocessor operating system design[J]. J. Super-comput.,1995,9:105-134.
    [39]Gamsa B, Krieger O, Appavoo J, Stumm M. Tornado:maximizing locality and concurrency in a shared memory multiprocessor operating system[C]. Proceedings of the third symposium on Operating systems design and im-plementation, OSDI'99. USENIX Association, Berkeley, CA, USA,1999 87-100.
    [40]Boyd-Wickizer S, Chen H, Chen R, Mao Y, Kaashoek F, Morris R, Pesterev A, Stein L, Wu M, Dai Y, Zhang Y, Zhang Z. Corey:an operating system for many cores[C]. Proceedings of the 8th USENIX conference on Oper-ating systems design and implementation, OSDI'08. USENIX Association, Berkeley, CA, USA,200843-57.
    [41]Verghese B, Devine S, Gupta A, Rosenblum M. Operating system support for improving data locality on cc-numa compute servers[C]. Proceedings of the seventh international conference on Architectural support for program-ming languages and operating systems, ASPLOS-VII. ACM, New York, NY, USA,1996 279-289.
    [42]Bugnion E, Devine S, Govil K, Rosenblum M. Disco:running commodity operating systems on scalable multiprocessors[J]. ACM Trans. Comput. Syst.,1997,15:412-447.
    [43]Govil K, Teodosiu D, Huang Y, Rosenblum M. Cellular disco:resource management using virtual clusters on shared-memory multiprocessors[J]. SIGOPS Oper. Syst. Rev.,1999,33:154-169.
    [44]Engler D R, Kaashoek M F, O'Toole J Jr. Exokernel:an operating system architecture for application-level resource management[C]. Proceedings of the fifteenth ACM symposium on Operating systems principles. SOSP'95. ACM, New York, NY, USA,1995 251-266.
    [45]Kaashoek M F, Engler D R, Ganger G R, Briceno H M, Hunt R, Mazieres D, Pinckney T, Grimm R, Jannotti J, Mackenzie K. Application performance and flexibility on exokernel systems[C]. Proceedings of the sixteenth ACM symposium on Operating systems principles, SOSP'97. ACM, New York, NY, USA,1997 52-65.
    [46]Chapin J, Rosenblum M, Devine S, Lahiri T, Teodosiu D, Gupta A. Hive: fault containment for shared-memory multiprocessors[C]. Proceedings of the fifteenth ACM symposium on Operating systems principles. SOSP'95. ACM, New York, NY, USA,1995 12-25.
    [47]Wentzlaff D, Agarwal A. Factored operating systems (fos):the case for a scalable operating system for multicores[J]. SIGOPS Oper. Syst. Rev.,2009, 43:76-85.
    [48]Baumann A, Barham P, Dagand P E, Harris T, Isaacs R, Peter S, Roscoe T, Schiipbach A, Singhania A. The multikernel:a new os architecture for scalable multicore systems[C]. Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP'09. ACM, New York, NY, USA,2009 29-44.
    [49]Nightingale E B, Hodson O, McIlroy R, Hawblitzel C, Hunt G. Helios: heterogeneous multiprocessing with satellite kernels [C]. Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP'09. ACM, New York, NY, USA,2009 221-234.
    [50]Song X, Chen H, Chen R, Wang Y, Zang B. A case for scaling applications to many-core with os cluster ing [C]. Proceedings of the 6nd ACM SIGOP-S/EuroSys European Conference on Computer Systems, EuroSys'11. ACM, New York, NY, USA,2011
    [51]Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization[C]. Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03. ACM, New York, NY, USA,2003 164-177.
    [52]Mckenney P E, Appavoo J, Kleen A, Krieger O, Krieger O, Russell R, Sarma D, Soni M. Read-copy update[C]. In Ottawa Linux Symposium.2001 338-367.
    [53]Aas J. Understanding the linux 2.6.8.1 cpu scheduler. http://josh.trancesoftware.com/linux/linux_cpu_scheduler.pdf,2005.
    [54]Mellor-Crummey J M, Scott M L. Algorithms for scalable synchronization on shared-memory multiprocessors [J]. ACM Trans. Comput. Syst.,1991,9: 21-65.
    [55]Boyd-Wickizer S, Clements A T, Mao Y, Pesterev A, Kaashoek M F, Morris R, Zeldovich N. An analysis of linux scalability to many cores[C]. Pro-ceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10. USENIX Association, Berkeley, CA, USA,2010 1-8.
    [56]Rossbach C J, Hofmann O S, Porter D E, Ramadan H E, Aditya B, Witchel E. Txlinux:using and managing hardware transactional memory in an operating system[C]. Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, SOSP'07. ACM, New York, NY, USA, 200787-102.
    [57]Herlihy M, Moss J E B. Transactional memory:architectural support for lock-free data structures [C]. Proceedings of the 20th annual international symposium on computer architecture, ISCA'93. ACM, New York, NY, USA, 1993 289-300.
    [58]Thekkath R, Eggers S J. Impact of sharing-based thread placement on multithreaded architectures [C]. Proceedings of the 21st annual international symposium on Computer architecture, ISCA'94. IEEE Computer Society Press, Los Alamitos, CA, USA,1994 176-186.
    [59]von Behren R, Condit J, Zhou F, Necula G C, Brewer E. Capriccio:scal-able threads for internet services[C]. Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP'03. ACM, New York, NY, USA,2003 268-281.
    [60]Fedorova A, Seltzer M, Small C, Nussbaum D. Performance of multithreaded chip multiprocessors and implications for operating system design [C]. Pro-ceedings of the annual conference on USENIX Annual Technical Conference, ATEC'05. USENIX Association, Berkeley, CA, USA,2005 26-26.
    [61]Tam D, Azimi R, Stumm M. Thread clustering:sharing-aware scheduling on smp-cmp-smt multiprocessors[C]. Proceedings of the 2nd ACM SIGOP-S/EuroSys European Conference on Computer Systems 2007, EuroSys'07. ACM, New York, NY, USA,200747-58.
    [62]Boyd-Wickizer S, Morris R, Kaashoek M F. Reinventing scheduling for multicore systems[C]. Proceedings of the 12th conference on Hot topics in operating systems, HotOS'09. USENIX Association, Berkeley, CA, USA, 2009 21-21.
    [63]Saha B, Adl-Tabatabai A R, Ghuloum A, Rajagopalan M, Hudson R L, Petersen L, Menon V, Murphy B, Shpeisman T, Sprangle E, Rohillah A, Carmean D, Fang J. Enabling scalability and performance in a large scale cmp environment[C]. Proceedings of the 2nd ACM SIGOPS/EuroSys Eu-ropean Conference on Computer Systems 2007, EuroSys'07. ACM, New York, NY, USA,2007 73-86.
    [64]Thies W, Karczmarek M, Amarasinghe S P. Streamit:A language for streaming applications[C]. Proceedings of the 11th International Confer-ence on Compiler Construction, CC'02. Springer-Verlag, London, UK,2002 179-196.
    [65]Gordon M I, Thies W, Karczmarek M, Lin J, Meli A S, Lamb A A, Leger C, Wong J, Hoffmann H, Maze D, Amarasinghe S. A stream compiler for communication-exposed architectures[C]. Proceedings of the 10th interna-tional conference on Architectural support for programming languages and operating systems, ASPLOS-X. ACM, New York, NY, USA,2002 291-303.
    [66]Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P. Brook for gpus:stream computing on graphics hardware[J]. ACM Trans. Graph.,2004,23:777-786.
    [67]Tarditi D, Puri S, Oglesby J. Accelerator:using data parallelism to program gpus for general-purpose uses[C]. Proceedings of the 12th international con-ference on Architectural support for programming languages and operating systems, ASPLOS-XII. ACM, New York, NY, USA,2006 325-335.
    [68]Fatahalian K, Horn D R, Knight T J, Leem L, Houston M, Park J Y, Erez M, Ren M, Aiken A, Dally W J, Hanrahan P. Sequoia:programming the memory hierarchy[C]. Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC'06. ACM, New York, NY, USA,2006.
    [69]Pham D, Asano S, Bolliger M, Day M N, Hofstee H P, Johns C, Kahle J, Kameyama A, Keaty J, Masubuchi Y, Riley M, Shippy D, Stasiak D, Suzuoki M, Wang M, Warnock J, Weitzel S, Wendel D, Yamazaki T, Yazawa K. The design and implementation of a first-generation cell processor[J]. 2005,184-592.
    [70]Wang P H, Collins J D, Chinya G N, Jiang H, Tian X, Girkar M, Yang N Y, Lueh G Y, Wang H. Exochi:architecture and programming environment for a heterogeneous multi-core multithreaded system [C]. Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, PLDI'07. ACM, New York, NY, USA,2007 156-166.
    [71]Ghuloum A, Smith T, Wu G, Zhou X, Fang J, Guo P, So B, Rajagopalan M, Chen Y, B C. Future-proof data parallel algorithms and software on intel multi-core architecture [J]. Intel Technology Journal,2007,11.
    [72]Linderman M D, Collins J D, Wang H, Meng T H. Merge:a programming model for heterogeneous multi-core systems [C]. Proceedings of the 13th in-ternational conference on Architectural support for programming languages and operating systems, ASPLOS XIII. ACM, New York, NY, USA,2008 287-296.
    [73]Borthakur D. The hadoop distributed file system:Architecture and design. http://hadoop. apache. org/hdfs/docs/current/hdfs_design.html.
    [74]Bialecki A, Cafarella M, Cutting D, O'Malley O. Hadoop:a framework for running applications on large clusters built of commodity hardware. http://lucene.apache.org/hadoop,2005.
    [75]Zaharia M, Konwinski A, Joseph A D, Katz R, Stoica I. Improving mapre-duce performance in heterogeneous environments[C]. Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI'08. USENIX Association, Berkeley, CA, USA,2008 29-42.
    [76]Amazon. Elastic compute cloud, http://aws.amazon.com/ec2.
    [77]Elteir M, Lin H, Feng W c. Enhancing MapReduce via Asynchronous Data Processing[C]. Proceedings of the 16th International Conference on Parallel and Distributed Systems (ICPADS). Shanghai, China,2010 397-405.
    [78]Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating mapreduce for multi-core and multiprocessor systems[C]. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA'07. IEEE Computer Society, Washington, DC, USA, 2007 13-24.
    [79]Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth:Scalable mapreduce on a large-scale shared-memory system [C]. Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09. IEEE Computer Society, Washington, DC, USA,2009 198-207.
    [80]Mao Y, Morris R, Kaashoek F. Optimizing Map Reduce for Multicore Ar-chitectures. Technical Report MIT-CSAIL-TR-2010-020, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technolo-gy,2010.
    [81]He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars:a mapreduce framework on graphics processors[C]. Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT'08. ACM, New York, NY, USA,2008 260-269.
    [82]de Kruijf M, Sankaralingam K. MapReduce for the Cell BE Architecture. Technical Report CS-TR-2007, Computer Sciences, University of Wisconsin, 2007.
    [83]Rafique M M, Rose B, Butt A R, Nikolopoulos D S. Cellmr:A framework for supporting mapreduce on asymmetric cell-based clusters [C]. Proceed-ings of the 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE Computer Society, Washington, DC, USA,2009 1-12.
    [84]Shan Y, Wang B, Yan J, Wang Y, Xu N, Yang H. Fpmr:Mapreduce framework on fpga[C]. Proceedings of the 18th annual ACM/SIGDA inter-national symposium on Field programmable gate arrays, FPGA'10. ACM, New York, NY, USA,2010 93-102.
    [85]Hong C, Chen D, Chen W, Zheng W, Lin H. Mapcg:writing parallel program portable between cpu and gpu[C]. Proceedings of the 19th inter-national conference on Parallel architectures and compilation techniques, PACT'10. ACM, New York, NY, USA,2010 217-226.
    [86]Pike R, Dorward S, Griesemer R, Quinlan S. Interpreting the data:Parallel analysis with sawzall[J]. Sci. Program.,2005,13:277-298.
    [87]Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin:a not-so-foreign language for data processing[C]. Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD'08. ACM, New York, NY, USA,2008 1099-1110.
    [88]Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive:a warehousing solution over a map-reduce framework [J]. Proc. VLDB Endow.,2009,2:1626-1629.
    [89]Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson U, Gunda P K, Currey J. Dryadlinq:a system for general-purpose distributed data-parallel comput-ing using a high-level language [C]. Proceedings of the 8th USENIX con-ference on Operating systems design and implementation (OSDI). USENIX Association, Berkeley, CA, USA,2008 1-14.
    [90]Isard M, Yu Y. Distributed data-parallel computing using a high-level pro-gramming language[C]. Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD'09. ACM, New York, NY, USA,2009 987-994.
    [91]Yang H c, Dasdan A, Hsiao R L, Parker D S. Map-reduce-merge:simplified relational data processing on large clusters[C]. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD'07. ACM, New York, NY, USA,2007 1029-1040.
    [92]Yang C, Yen C, Tan C, Madden S R. Osprey:Implementing mapreduce-style fault tolerance in a shared-nothing distributed database[C]. Proceedings of the 2010 IEEE International Conference on Data Engineering (ICDE). IEEE Computer Society, Los Alamitos, CA, USA,2010 657-668.
    [93]DeWitt D, Stonebraker M. Mapreduce:A major step backward-s. http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html,2008.
    [94]Stonebraker M, Abadi D, DeWitt D J, Madden S, Paulson E, Pavlo A, Rasin A. Mapreduce and parallel dbmss:friends or foes?[J]. Commun. ACM,2010, 53:64-71.
    [95]Yu Y, Gunda P K, Isard M. Distributed aggregation for data-parallel computing:interfaces and implementations[C]. Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP'09. ACM, New York, NY, USA,2009 247-260.
    Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. Hadoopdb:an architectural hybrid of mapreduce and dbms technologies for analytical workloads[J]. Proc. VLDB Endow.,2009,2:922-933.
    [97]Chu C T, Kim S K, Lin Y A, Yu Y, Bradski G R, Ng A Y, Olukotun K. Map-reduce for machine learning on multicore[C]. Proceedings of Neural In-formation Processing Systems Conference, NIPS'06. MIT Press, Vancouver, Canada,2006 281-288.
    [98]Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark:cluster computing with working sets[C]. Proceedings of the 2nd USENIX confer-ence on Hot topics in cloud computing, HotCloud'10. USENIX Association, Berkeley, CA, USA,2010 10-10.
    [99]Ekanayake J, Pallickara S, Fox G. Mapreduce for data intensive scientific analyses [C]. Proceedings of the 2008 Fourth IEEE International Conference on eScience (eScience). IEEE Computer Society, Washington, DC, USA, 2008 277-284.
    [100]Condie T, Conway N, Alvaro P, Hellerstein J M, Elmeleegy K, Sears R. Mapreduce online[C]. Proceedings of the 7th USENIX conference on Net-worked systems design and implementation, NSDI'10. USENIX Association, Berkeley, CA, USA,2010 21-21.
    [101]Lin H, Ma X, Archuleta J, Feng W c, Gardner M, Zhang Z. Moon:Mapre-duce on opportunistic environments [C]. Proceedings of the 19th ACM Inter-national Symposium on High Performance Distributed Computing, HPDC '10. ACM, New York, NY, USA,2010 95-106.
    [102]Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications[C]. Proceedings of the 9th USENIX confer-ence on Operating systems design and implementation, OSDI'10. USENIX Association, Berkeley, CA, USA,2010 1-15.
    [103]Logothetis D, Olston C, Reed B, Webb K C, Yocum K. Stateful bulk pro-cessing for incremental analytics[C]. Proceedings of the 1st ACM symposium on Cloud computing, SoCC'10. ACM, New York, NY, USA,2010 51-62.
    [104]Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S H, Qiu J, Fox G. Twister: a runtime for iterative mapreduce[C]. Proceedings of the 19th ACM Inter-national Symposium on High Performance Distributed Computing, HPDC '10. ACM, New York, NY, USA,2010 810-818.
    [105]Butenhof D. Programming with POSIX threads[M]. Addison-Wesley,1997.
    [106]Coleman S, McKinley K S. Tile size selection using cache organization and data layout[C]. Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, PLDI'95. ACM, New York, NY, USA,1995 279-290.
    [107]Golub G, Van Loan C. Matrix computations[M]. Johns Hopkins University Press,1996.
    [108]Wiki. Hadoop poweredby. http://wiki.apache.org/hadoop/PoweredBy.
    [109]Dean J. Handling large datasets at google:Current systems and future directions. http://research.yahoo.com/files/6DeanGoogle.pdf.
    [110]Mazieres D, Kaminsky M, Kaashoek M F, Witchel E. Separating key man-agement from file system security[C]. Proceedings of the seventeenth ACM symposium on Operating systems principles, SOSP'99. ACM, New York, NY, USA,1999 124-139.
    [111]INRIA. Portable hardware locality. http://www.open-mpi.org/proj ects/hwloc/.
    [112]Yotov K, Roeder T, Pingali K, Gunnels J, Gustavson F. An experimental comparison of cache-oblivious and cache-conscious programs [C]. Proceed-ings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, SPAA'07. ACM, New York, NY, USA,2007 93-104.
    [113]Schneider S, Antonopoulos C D, Nikolopoulos D S. Scalable locality-conscious multithreaded memory allocation[C]. Proceedings of the 5th inter-national symposium on Memory management, ISMM'06. ACM, New York, NY, USA,2006 84-94.
    [114]Gray J. http://www.hpl.hp.com/hosted/sortbenchmark.
    [115]Waingold E, Taylor M, Sarkar V, Lee W, Lee V, Kim J, Frank M, Finch P, Devabhaktuni S, Barua R, et al. Baring it all to software:the raw machine[J]. IEEE computer,1997,30(9):86-93.
    [116]Levon J. OProfile Manual. Victoria University of Manchester. Http://oprofile.sourceforge.net/doc/.
    [117]Popa L, Budiu M, Yu Y, Isard M. Dryadinc:reusing work in large-scale computations[C]. Proceedings of the 2009 conference on Hot topics in cloud computing, HotCloud'09. USENIX Association, Berkeley, CA, USA,2009 21-21.
    [118]Bu Y, Howe B, Balazinska M, Ernst M D. Haloop:efficient iterative data processing on large clusters [J]. Proc. VLDB Endow.,2010,3:285-296.
    [119]Apache. Hadoop streaming. http://hadoop.apache.org/common/docs/r0.20.2/streami
    [120]Kumar V, Andrade H, Gedik B, Wu K L. Deduce:at the intersection of mapreduce and stream processing[C]. Proceedings of the 13th Internation-al Conference on Extending Database Technology, EDBT'10. ACM, New York, NY, USA,2010 657-662.
    [121]Hellerstein J M, Haas P J, Wang H J. Online aggregation[C]. Proceedings of the 1997 ACM SIGMOD international conference on Management of data, SIGMOD'97. ACM, New York, NY, USA,1997 171-182.
    [122]Wikimedia. Downloads, http://dumps.wikimedia.org/.
    [123]Pugh W, Teitelbaum T. Incremental computation via function caching[C]. Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'89. ACM, New York, NY, USA,1989 315-328.
    [124]Liu Y A, Stoller S D, Teitelbaum T. Static caching for incremental compu-tation[J]. ACM Trans. Program. Lang. Syst.,1998,20:546-585.
    [125]Md5 homepage, http://userpages.umbc.edu/mabzugl/cs/md5/md5.html.
    [126]libmhash. http://sourceforge.net/projects/mhash/.
    [127]Evans J. jemalloc. http://www.canonware.com/jemalloc/.
    [128]Sanjay Ghemawat P M. Tcmalloc:Thread-caching malloc. http://google-perftools.googlecode.com/svn/trunk/doc/tcmalloc.html.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700