Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

详细信息查看全文

关键词：Barrier synchronization ; Scalability ; Algorithms ; Many ; core architectures ; Intel Xeon Phi
刊名：Lecture Notes in Computer Science
出版年：2015
出版时间：2015
年：2015
卷：9233
期：1
页码：588-600
全文大小：4,294 KB
参考文献：1.Agarwal, A., Cherian, M.: Adaptive backoff synchronization techniques. In: Proceedings of the of the International Symposium on Computer Architecture, pp. 396-06 (1989)
2.Brooks III, E.D.: The butterfly barrier. Int. J. Parallel Program. 15(4), 295-07 (1986)View Article MATH
3.Bull, J.M.: Measuring synchronisation and scheduling overheads in OpenMP. In: Proceedings of the First European Workshop on OpenMP, pp. 99-05 (1999)
4.Caballero, D., Duran, A., Martorell, X.: An OpenMP barrier usingSIMD instructions for Intel\(^{\textregistered }\) Xeon Phi\(^{\rm TM}\) coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 99-13. Springer, Heidelberg (2013)
5.Cownie, J.: Fastest possible barrier (Intel developer zone forum discussion) (2013). http://?software.?intel.?com/?en-us/?forums/?topic/-92587 . Last accessed 1-Jun-2015
6.Dolbeau, R.: Address selection for efficient barriers on the Intel Xeon Phi (2013). http://?www.?dolbeau.?name/?dolbeau/?publications/?barrierphi.?pdf . Last accessed 1 Jun 2015
7.Grunwald, D., Vajracharya, S.: Efficient barriers for distributed shared memory computers. In: Proceedings of International Parallel Processing Symposium, pp. 604-08 (1994)
8.Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17(1), 1-7 (1988)View Article MATH
9.Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBand. In: 20th International Parallel and Distributed Processing Symposium, p. 7 (2006)
10.Intel Xeon Phi coprocessor system software developers guide (2014). https://?software.?intel.?com/?sites/?default/?files/?managed/-9/-7/?xeon-phi-coprocessor-system-software-developers-guide.?pdf . Last accessed 1 Jun 2015
11.Krishnaiyer, R., Kultursay, E., Chawla, P., Preis, S., Zvezdin, A., Saito, H.: Compiler-based data prefetching and streaming non-temporal store generation for the Intel Xeon Phi coprocessor. In: Workshop on Multithreaded Architectures and Applications published as 27th IEEE IPDPSW, pp. 1575-586 (2013)
12.Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21-5 (1991)View Article
13.NAS parallel benchmarks. http://?www.?nas.?nasa.?gov/?publications/?npb.?html . Last accessed 1 Jun 2015
14.Ramos, S., Hoefler, T.: Modeling communication in cache-coherent smp systems: A case-study with Xeon Phi. In: High-Performance Parallel and Distributed Computing 2013, pp. 97-08 (2013)
15.Sartori, J., Kumar, R.: Low-overhead, high-speed multi-core barrier synchronization. In: Proceedings of the 5th International Conference on High Performance and Embedded Architecture and Compilation, pp. 18-4 (2010)
16.Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization, pp. 137-48 (2011)
17.Shirako, J., Peixotto, D.M., Sarkar, V., Scherer, W.N.: Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In: Proceedings of the 22nd International Conference on Supercomputing, pp. 277-88 (2008)
18.Yew, P.C., Tzeng, N.F., Lawrie, D.H.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C-6(4), 388-95 (1987)
作者单位：Andrey Rodchenko (16)
Andy Nisbet (16)
Antoniu Pop (16)
Mikel Luján (16)

16. School of Computer Science, The University of Manchester, Manchester, UK
丛书名：Euro-Par 2015: Parallel Processing
ISBN：978-3-662-48096-0
刊物类别：Computer Science
刊物主题：Artificial Intelligence and Robotics
Computer Communication Networks
Software Engineering
Data Encryption
Database Management
Computation by Abstract Devices
Algorithm Analysis and Problem Complexity
出版者：Springer Berlin / Heidelberg
ISSN：1611-3349

文摘

Barriers are a fundamental synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art barrier synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC barrier OpenMP microbenchmark. The optimized barriers presented in the paper are available at https://?github.?com/?arodchen/?cbarriers released as free software.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700