Resilient MPI applications using an application-level checkpointing framework and ULFM
详细信息    查看全文
  • 作者:Nuria Losada ; Iván Cores ; María J. Martín…
  • 关键词:Resilience ; Checkpointing ; Fault Tolerance ; MPI
  • 刊名:The Journal of Supercomputing
  • 出版年:2017
  • 出版时间:January 2017
  • 年:2017
  • 卷:73
  • 期:1
  • 页码:100-113
  • 全文大小:
  • 刊物类别:Computer Science
  • 刊物主题:Programming Languages, Compilers, Interpreters; Processor Architectures; Computer Science, general;
  • 出版者:Springer US
  • ISSN:1573-0484
  • 卷排序:73
文摘
Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700