Artem Shinkarov - 2019-max-resilience

Checkpointing Kernel Executions of MPI+CUDA Applications

M. Baird, S.-B. Scholz, Šinkarovs, Artjoms, and L. Bautista-Gomez, “Checkpointing kernel executions of MPI+CUDA applications,” in Euro-par 2019: Parallel processing workshops, May 2020, pp. 694–706.

[URL] [BibTeX]

Abstract

This paper proposes a new approach to checkpointing MPI applications that use long-running CUDA kernels. It becomes possible to take snapshots of data residing on the GPUs without waiting for kernels to complete. The proposed technique is implemented in the context of the state of the art high performance fault tolerance library FTI. As a result we get an elegant solution to the problem of developing resilient MPI applications where GPU kernels run longer than the mean time between hardware failures. We describe in detail how we checkpoint/restart collaborative MPI-CUDA applications, and we provide an initial evaluation of the proposed approach using the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) application as a case study.