Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
[책][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
Post-failure recovery of MPI communication capability: Design and rationale
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …
The SIMNET virtual world architecture
J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org
Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …
in a virtual world. Few have been designed with an architecture that allows large numbers of …
A user-level infiniband-based file system and checkpoint strategy for burst buffers
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-
performance computing applications that run continuously for hours or days at a time …
performance computing applications that run continuously for hours or days at a time …
Unified model for assessing checkpointing protocols at extreme‐scale
In this paper, we present a unified model for several well‐known checkpoint/restart
protocols. The proposed model is generic enough to encompass both extremes of the …
protocols. The proposed model is generic enough to encompass both extremes of the …
Local rollback for resilient MPI applications with application-level checkpointing and message logging
The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …
coordinated checkpoint/restart, a global rollback of all the processes that are running the …
Systems and methods for fault tolerant communications
R Knight - US Patent 9,424,149, 2016 - Google Patents
Apparatuses, systems and methods are disclosed for tolerating fault in a communications
grid. Specifically, various techniques and systems are provided for detecting a fault or failure …
grid. Specifically, various techniques and systems are provided for detecting a fault or failure …
Efficient synchronization under global EDF scheduling on multiprocessors
UMC Devi, H Leontyev… - … Euromicro Conference on …, 2006 - ieeexplore.ieee.org
We consider coordinating accesses to shared data structures in multiprocessor real-time
systems scheduled under preemptive global EDF. To our knowledge, prior work on global …
systems scheduled under preemptive global EDF. To our knowledge, prior work on global …
Hydee: Failure containment without event logging for large scale send-deterministic mpi applications
High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …