Exploring automatic, online failure recovery for scientific applications at extreme scales
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …
exascale vision. Process/node failures, an important class of failures, are typically handled …
Internet of Things (IoT): a survey
Internet of things (IoT) is considered as the next evolution of the Internet. IoT is considered
as a global network of things, having a distinct identity, and these are interconnected via a …
as a global network of things, having a distinct identity, and these are interconnected via a …
FlipIt: An LLVM based fault injector for HPC
High performance computing (HPC) is increasingly subjected to faulty computations. The
frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging …
frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging …
[PDF][PDF] Quantifying the impact of single bit flips on floating point arithmetic
In high-end computing, the collective surface area, smaller fabrication sizes, and increasing
density of components have led to an increase in the number of observed bit flips. Such flips …
density of components have led to an increase in the number of observed bit flips. Such flips …
Fault tolerance for remote memory access programming models
Remote Memory Access (RMA) is an emerging mechanism for programming high-
performance computers and datacenters. However, little work exists on resilience schemes …
performance computers and datacenters. However, little work exists on resilience schemes …
SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing
The high failure rate expected for future supercomputers requires the design of new fault
tolerant solutions. Most checkpointing protocols are designed to work with any message …
tolerant solutions. Most checkpointing protocols are designed to work with any message …
Unified fault-tolerance framework for hybrid task-parallel message-passing applications
We present a unified fault-tolerance framework for task-parallel message-passing
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …
A method to represent multiple-output switching functions by using multi-valued decision diagrams
T Sasao, JT Butler - … of 26th IEEE International Symposium on …, 1996 - ieeexplore.ieee.org
Multiple-output switching functions can be simulated by multiple-valued decision diagrams
(MDDs) at a significant reduction in computation time. analyze the following approaches to …
(MDDs) at a significant reduction in computation time. analyze the following approaches to …
System-wide trade-off modeling of performance, power, and resilience on petascale systems
While performance remains a major objective in the field of high-performance computing
(HPC), future systems will have to deliver desired performance under both reliability and …
(HPC), future systems will have to deliver desired performance under both reliability and …
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
D Göddeke, M Altenbernd, D Ribbrock - Parallel Computing, 2015 - Elsevier
We analyse novel fault tolerance schemes for data loss in multigrid solvers, which
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …