Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

[책][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

The SIMNET virtual world architecture

J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org
Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …

A user-level infiniband-based file system and checkpoint strategy for burst buffers

K Sato, K Mohror, A Moody, T Gamblin… - 2014 14th IEEE/ACM …, 2014 - ieeexplore.ieee.org
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-
performance computing applications that run continuously for hours or days at a time …

Unified model for assessing checkpointing protocols at extreme‐scale

G Bosilca, A Bouteiller, E Brunet… - Concurrency and …, 2014 - Wiley Online Library
In this paper, we present a unified model for several well‐known checkpoint/restart
protocols. The proposed model is generic enough to encompass both extremes of the …

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier
The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

Systems and methods for fault tolerant communications

R Knight - US Patent 9,424,149, 2016 - Google Patents
Apparatuses, systems and methods are disclosed for tolerating fault in a communications
grid. Specifically, various techniques and systems are provided for detecting a fault or failure …

Efficient synchronization under global EDF scheduling on multiprocessors

UMC Devi, H Leontyev… - … Euromicro Conference on …, 2006 - ieeexplore.ieee.org
We consider coordinating accesses to shared data structures in multiprocessor real-time
systems scheduled under preemptive global EDF. To our knowledge, prior work on global …

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …