IOPro: a parallel I/O profiling and visualization framework for high-performance storage systems

详细信息查看全文

作者：Seong Jo Kim (1)
Yuanrui Zhang (2)
Seung Woo Son (3)
Mahmut Kandemir (1)
Wei-keng Liao (4)
Rajeev Thakur (5)
Alok Choudhary (4)

1. Pennsylvania State University ; University Park ; PA ; 16802 ; USA
2. Intel Corporation ; Santa Clara ; CA ; 95054 ; USA
3. University of Massachusetts Lowell ; Lowell ; MA ; 01854 ; USA
4. Northwestern University ; Evanston ; IL ; 60208 ; USA
5. Argonne National Laboratory ; Argonne ; IL ; 60439 ; USA
关键词：MPI ; IO ; Parallel file systems ; Parallel NetCDF ; HDF5 ; I/O software stack ; Code instrumentation ; Performance visualization
刊名：The Journal of Supercomputing
出版年：2015
出版时间：March 2015
年：2015
卷：71
期：3
页码：840-870
全文大小：3,346 KB
参考文献：1. Ching A, Choudhary A, Coloma K, Liao W-K, Ross R, Gropp W (2003)Noncontiguous I/O accesses through MPI-IO. In: Cluster computing and the grid, (2003) Proceedings. CCGrid 2003. 3rd IEEE/ACM international symposium. IEEE, pp 104鈥?11
2. Ma X, Winslett M, Lee J, Yu S (2003) Improving MPI-IO output performance with active buffering plus threads. In: Parallel and distributed processing symposium, 2003. Proceedings international. IEEE, p 10
3. Coloma K, Choudhary A, Liao W-K, Ward L, Russell E, Pundit N (2004) Scalable high-level caching for parallel I/O. In: Parallel and distributed processing symposium, 2004. Proceedings of 18th international. IEEE, p 96
4. Liao W, Ching A, Coloma K, Choudhary A (2007) An implementation and evaluation of client-side file caching for MPI-IO. In: IEEE international parallel and distributed processing symposium. IEEE, p 49
5. Thakur R, Gropp W, Lusk E (1998) Data sieving and collective I/O in ROMIO. In: Proceedings of the seventh symposium on the frontiers of massively parallel computation. Published by the IEEE Computer Society, pp 182鈥?89
6. Rosario, J, Bordawekar, R, Choudhary, A (1993) Improved parallel I/O via a two-phase run-time access strategy. ACM SIGARCH Comput Archit News 21: pp. 31-38 CrossRef
7. Shan H, Antypas K, Shalf J (2008) Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, p 42
8. Zhang X, Davis K, Jiang S (2010) IOrchestrator: improving the performance of multi-node I/O systems via inter-server coordination. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis. IEEE Computer Society, pp 1鈥?1
9. Lofstead J, Zheng F, Liu Q, Klasky S, Oldfield R, Kordenbrock T, Schwan K, Wolf M (2010) Managing variability in the IO performance of petascale storage systems. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis. IEEE Computer Society, pp 1鈥?2
10. Song H, Yin Y, Sun X-H, Thakur R, Lang S (2011) Server-side I/O coordination for parallel file systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, p 17
11. Zhang X, Davis K, Jiang S (2011) QoS support for end users of I/O-intensive applications using shared storage systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, p 18
12. Zhang X, Davis K, Jiang S (2012) Opportunistic data-driven execution of parallel programs for efficient I/O services. In: Parallel and distributed processing symposium (IPDPS) (2012) IEEE 26th international. IEEE, pp 330鈥?41
13. Chen Y, Sun X-H, Thakur R, Song H, Jin H (2010) Improving parallel I/O performance with data layout awareness. In: Cluster computing (CLUSTER), 2010 IEEE international conference. IEEE, pp 302鈥?11
14. Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the 2003 Linux symposium, vol 2003
15. Schmuck FB, Haskin RL (2002) GPFS: a shared-disk file system for large computing clusters. In: FAST, vol 2, p 19
16. Welch B, Unangst M, Abbasi Z, Gibson G, Mueller B, Small J, Zelenka J, Zhou B (2008) Scalable performance of the Panasas parallel file system. In: Proceedings of the 6th USENIX conference on file and storage technologies. USENIX Association, p 2
17. Carns P, Ligon III W, Ross R, Thakur R (2000) PVFS: a parallel file system for Linux clusters. In: Proceedings of the 4th annual Linux showcase & conference, vol 4. USENIX Association, pp 28鈥?8
18. Thakur R, Gropp W, Lusk E (1999) On implementing MPI-IO portably and with high performance. In: Proceedings of the sixth workshop on I/O in parallel and distributed systems. ACM, pp 23鈥?2
19. Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, Snirv (1998) MPI - the complete reference, The MPI-2 extensions, vol 2
20. Li J, Liao W, Choudhary A, Ross R, Thakur R, Gropp W, Latham R, Siegel A, Gallagher B, Zingale M (2003) Parallel netCDF: a high-performance scientific I/O interface. In: Proceedings of the 2003 ACM/IEEE conference on supercomputing. IEEE Computer Society, p 39
21. The HDF Group (1997-2014) Hierarchical Data Format, version 5. http://www.hdfgroup.org/HDF5/
22. Ali N, Carns P, Iskra K, Kimpe D, Lang S, Latham R, Ross R, Ward L, Sadayappan P (2009) Scalable I/O forwarding framework for high-performance computing systems. In: IEEE international conference on cluster computing and workshops, 2009. CLUSTER鈥?9. IEEE, pp 1鈥?0
23. Srivastava A, Eustace A (1994) ATOM: a system for building customized program analysis tools. ACM 29(6):196鈥?05
24. De Bus B, Chanet D, De Sutter B, Van Put L, De Bosschere K (2004) The design and implementation of FIT: a flexible instrumentation toolkit. In: Proceedings of the 5th ACM SIGPLAN鈥揝IGSOFT workshop on program analysis for software tools and engineering. ACM, pp 29鈥?4
25. Bala V, Duesterwald E, Banerjia S (2000) Dynamo: a transparent dynamic optimization system. In: ACM SIGPLAN Notices, vol 35, no. 5. ACM, pp 1鈥?2
26. Bruening DL (2004) Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. dissertation, Massachusetts Institute of Technology
27. Luk C, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi V, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN notices, vol 40, no. 6. ACM, pp 190鈥?00
28. Source O. Dyninst: An application program interface (api) for runtime code generation. Online, http://www.dyninst.org
29. Hollingsworth JK, Niam O, Miller BP, Xu Z, Gon莽alves MJ, Zheng L (1997) MDL: a language and compiler for dynamic program instrumentation. In: Proceedings of parallel architectures and compilation techniques. IEEE, pp 201鈥?12
30. Nieuwejaar, N, Kotz, D, Purakayastha, A, Ellis, S, Best, M (1996) File-access characteristics of parallel scientific workloads. Parallel Distrib Syst IEEE Trans 7: pp. 1075-1089 CrossRef
31. Simitci H (1996) Pablo MPI instrumentation user鈥檚 guide. Department of Computer Science, University of Illinois
32. Moore S, Wolf F, Dongarra J, Shende S, Malony A, Mohr B (2005) A scalable approach to MPI application performance analysis. Recent advances in parallel virtual machine and message passing, interface
33. Moore S, Cronk D, London K, Dongarra J (2001) Review of performance analysis tools for mpi parallel programs. In: Recent advances in parallel virtual machine and message passing interface. Springer, pp 241鈥?48
34. Pillet V, Labarta J, Cortes T, Girona S (1995) Paraver: a tool to visualize and analyze parallel code. In: Proceedings of WoTUG-18: transputer and occam developments, vol 44, pp 17鈥?1
35. Open |SpeedShop. http://www.openspeedshop.org/wp/
36. Mohr B, Wolf F (2004) KOJAK鈥攁 tool set for automatic performance analysis of parallel programs. Euro-Par 2003 parallel processing, pp 1301鈥?304
37. Arnold D, Ahn D, De Supinski B, Lee G, Miller B, Schulz M (2007) Stack trace analysis for large scale debugging. In: IEEE international parallel and distributed processing symposium. IEEE, p 64
38. Barham P, Donnelly A, Isaacs R, Mortier R (2004) Using Magpie for request extraction and workload modelling. In: OSDI, vol 4, p 18
39. Sigelman BH, Barroso LA, Burrows M, Stephenson P, Plakal M, Beaver D, Jaspan S, Shanbhag C (2010) Dapper, a large-scale distributed systems tracing infrastructure, Google research
40. Erlingsson, 脷, Peinado, M, Peter, S, Budiu, M, Mainar-Ruiz, G (2012) Fay: extensible distributed tracing from kernels to clusters. ACM Trans Comput Syst (TOCS) 30: pp. 13 CrossRef
41. Lee, GL, Schulz, M, Ahn, DH, Bernat, A, Supinski, BR, Ko, SY, Rountree, B (2007) Dynamic binary instrumentation and data aggregation on large scale systems. Int J Parallel Program 35: pp. 207-232 CrossRef
42. Carns P, Latham R, Ross R, Iskra K, Lang S, Riley K (2009) 24, 7 characterization of petascale I, O workloads. In: Cluster computing and workshops, (2009) CLUSTER鈥?9. IEEE international conference. IEEE, pp 1鈥?0
43. Nagel WE, Arnold A, Weber M, Hoppe H-C, Solchenbach K (1996) VAMPIR: visualization and analysis of MPI resources
44. Kim SJ, Son SW, Liao W-K, Kandemir M, Thakur R, Choudhary A (2012) IOPin: runtime profiling of parallel I/O in HPC systems. In: High performance computing, networking, storage and analysis (SCC). SC Companion: IEEE, pp 18鈥?3
45. Fryxell, B, Olson, K, Ricker, P, Timmes, F, Zingale, M, Lamb, D, MacNeice, P, Rosner, R, Truran, J, Tufo, H (2000) FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. The Astrophys J Suppl Ser 131: pp. 273 CrossRef
46. Gurumurthi S, Sivasubramaniam A, Kandemir M, Franke H (2003) DRPM: dynamic speed control for power management in server class disks. In: Computer architecture, (2003) Proceedings of 30th annual international symposium. IEEE, pp 169鈥?79
47. Sankaran R, Hawkes E, Chen J, Lu T, Law C (2006) Direct numerical simulations of turbulent lean premixed combustion. In: Journal of physics: conference series, vol 46. IOP Publishing, p 38
刊物类别：Computer Science
刊物主题：Programming Languages, Compilers and Interpreters
Processor Architectures
Computer Science, general
出版者：Springer Netherlands
ISSN：1573-0484

文摘

Efficient execution of large-scale scientific applications requires high-performance computing systems designed to meet the I/O requirements. To achieve high-performance, such data-intensive parallel applications use a multi-layer layer I/O software stack, which consists of high-level I/O libraries such as PnetCDF and HDF5, the MPI library, and parallel file systems. To design efficient parallel scientific applications, understanding the complicated flow of I/O operations and the involved interactions among the libraries is quintessential. Such comprehension helps identify I/O bottlenecks and thus exploits the potential performance in different layers of the storage hierarchy. To profile the performance of individual components in the I/O stack and to understand complex interactions among them, we have implemented a GUI-based integrated profiling and analysis framework, IOPro. IOPro automatically generates an instrumented I/O stack, runs applications on it, and visualizes detailed statistics based on the user-specified metrics of interest. We present experimental results from two different real-life applications and show how our framework can be used in practice. By generating an end-to-end trace of the whole I/O stack and pinpointing I/O interference, IOPro aids in understanding I/O behavior and improving the I/O performance significantly.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700