A survey of rollback-recovery protocols in message-passing systems
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …
constructs. In the first part of the survey we classify rollback-recovery protocols into …
MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
Global Computing platforms, large scale clusters and future TeraGRID systems gather
thousands of nodes for computing parallel scientific applications. At this scale, node failures …
thousands of nodes for computing parallel scientific applications. At this scale, node failures …
Automated application-level checkpointing of MPI programs
The running times of many computational science applications, such as protein-folding
using ab initio methods, are much longer than the mean-time-to-failure of high-performance …
using ab initio methods, are much longer than the mean-time-to-failure of high-performance …
The design and implementation of checkpoint/restart process fault tolerance for Open MPI
To be able to fully exploit ever larger computing platforms, modern HPC applications and
system software must be able to tolerate inevitable faults. Historically, MPI implementations …
system software must be able to tolerate inevitable faults. Historically, MPI implementations …
Fault tolerance in message passing interface programs
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI)
applications. We discuss the meaning of fault tolerance in general and what the MPI …
applications. We discuss the meaning of fault tolerance in general and what the MPI …
MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
Execution of MPI applications on clusters and Grid deployments suffering from node and
network failures motivates the use of fault tolerant MPI implementations. We present MPICH …
network failures motivates the use of fault tolerant MPI implementations. We present MPICH …
MPICH-V project: A multiprotocol automatic fault-tolerant MPI
High performance computing platforms such as Clusters, Grid and Desktop Grids are
becoming larger and subject to more frequent failures. MPI is one of the most used message …
becoming larger and subject to more frequent failures. MPI is one of the most used message …
An analysis of communication induced checkpointing
Communication induced checkpointing (CIC) allows processes in a distributed computation
to take independent checkpoints and to avoid the domino effect. This paper presents an …
to take independent checkpoints and to avoid the domino effect. This paper presents an …
Coordinated checkpoint versus message log for fault tolerant MPI
Bouteiller, Lemarinier, Krawezik… - 2003 Proceedings IEEE …, 2003 - ieeexplore.ieee.org
MPI is one of the most adopted programming models for large clusters and grid
deployments. However, these systems often suffer from network or node failures. This raises …
deployments. However, these systems often suffer from network or node failures. This raises …
Redesigning the message logging model for high performance
Over the past decade the number of processors used in high performance computing has
increased to hundreds of thousands. As a direct consequence, and while the computational …
increased to hundreds of thousands. As a direct consequence, and while the computational …