Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

[图书][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

Algorithm-based fault tolerance applied to high performance computing

G Bosilca, R Delmas, J Dongarra, J Langou - Journal of Parallel and …, 2009 - Elsevier
We present a new approach to fault tolerance for High Performance Computing system. Our
approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance …

Algorithm-based fault tolerance for dense matrix factorizations

P Du, A Bouteiller, G Bosilca, T Herault… - Acm sigplan notices, 2012 - dl.acm.org
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific
applications that require solving systems of linear equations, eigenvalues and linear least …

An evaluation of user-level failure mitigation support in MPI

W Bland, A Bouteiller, T Herault, J Hursey… - Recent Advances in the …, 2012 - Springer
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …

Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arxiv preprint arxiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

Unified model for assessing checkpointing protocols at extreme‐scale

G Bosilca, A Bouteiller, E Brunet… - Concurrency and …, 2014 - Wiley Online Library
In this paper, we present a unified model for several well‐known checkpoint/restart
protocols. The proposed model is generic enough to encompass both extremes of the …