Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …
reinvigorated the community interest in how to manage failures in such systems and ensure …
[图书][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
Toward exascale resilience
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …
computing (HPC) systems, in particular in the perspective of large petascale systems and …
Post-failure recovery of MPI communication capability: Design and rationale
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …
Algorithm-based fault tolerance applied to high performance computing
We present a new approach to fault tolerance for High Performance Computing system. Our
approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance …
approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance …
Algorithm-based fault tolerance for dense matrix factorizations
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific
applications that require solving systems of linear equations, eigenvalues and linear least …
applications that require solving systems of linear equations, eigenvalues and linear least …
An evaluation of user-level failure mitigation support in MPI
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …
application fault tolerance are increasing as well. Techniques to address this problem by …
Fault tolerance of MPI applications in exascale systems: The ULFM solution
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
Resilience design patterns: A structured approach to resilience at extreme scale
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …
systems. While the HPC community has developed various resilience solutions, the solution …
Unified model for assessing checkpointing protocols at extreme‐scale
In this paper, we present a unified model for several well‐known checkpoint/restart
protocols. The proposed model is generic enough to encompass both extremes of the …
protocols. The proposed model is generic enough to encompass both extremes of the …