A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes

G Bosilca, A Bouteiller, F Cappello… - SC'02: Proceedings …, 2002 - ieeexplore.ieee.org
Global Computing platforms, large scale clusters and future TeraGRID systems gather
thousands of nodes for computing parallel scientific applications. At this scale, node failures …

Automated application-level checkpointing of MPI programs

G Bronevetsky, D Marques, K **ali… - Proceedings of the ninth …, 2003 - dl.acm.org
The running times of many computational science applications, such as protein-folding
using ab initio methods, are much longer than the mean-time-to-failure of high-performance …

The design and implementation of checkpoint/restart process fault tolerance for Open MPI

J Hursey, JM Squyres, TI Mattox… - 2007 IEEE …, 2007 - ieeexplore.ieee.org
To be able to fully exploit ever larger computing platforms, modern HPC applications and
system software must be able to tolerate inevitable faults. Historically, MPI implementations …

Fault tolerance in message passing interface programs

W Gropp, E Lusk - The International Journal of High …, 2004 - journals.sagepub.com
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI)
applications. We discuss the meaning of fault tolerance in general and what the MPI …

MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging

A Bouteiller, F Cappello, T Herault, G Krawezik… - Proceedings of the …, 2003 - dl.acm.org
Execution of MPI applications on clusters and Grid deployments suffering from node and
network failures motivates the use of fault tolerant MPI implementations. We present MPICH …

MPICH-V project: A multiprotocol automatic fault-tolerant MPI

A Bouteiller, T Herault, G Krawezik… - … Journal of High …, 2006 - journals.sagepub.com
High performance computing platforms such as Clusters, Grid and Desktop Grids are
becoming larger and subject to more frequent failures. MPI is one of the most used message …

An analysis of communication induced checkpointing

L Alvisi, E Elnozahy, S Rao, SA Husain… - Digest of Papers …, 1999 - ieeexplore.ieee.org
Communication induced checkpointing (CIC) allows processes in a distributed computation
to take independent checkpoints and to avoid the domino effect. This paper presents an …

Coordinated checkpoint versus message log for fault tolerant MPI

Bouteiller, Lemarinier, Krawezik… - 2003 Proceedings IEEE …, 2003 - ieeexplore.ieee.org
MPI is one of the most adopted programming models for large clusters and grid
deployments. However, these systems often suffer from network or node failures. This raises …

Redesigning the message logging model for high performance

A Bouteiller, G Bosilca… - … and Computation: Practice …, 2010 - Wiley Online Library
Over the past decade the number of processors used in high performance computing has
increased to hundreds of thousands. As a direct consequence, and while the computational …