Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods
Z Chen - ACM SIGPLAN Notices, 2013 - dl.acm.org
Soft errors are one-time events that corrupt the state of a computing system but not its overall
functionality. Large supercomputers are especially susceptible to soft errors because of their …
functionality. Large supercomputers are especially susceptible to soft errors because of their …
Resilience design patterns: A structured approach to resilience at extreme scale
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …
systems. While the HPC community has developed various resilience solutions, the solution …
When is multi-version checkpointing needed?
The scaling of semiconductor technology and increasing power concerns combined with
system scale make fault management a growing concern in high performance computing …
system scale make fault management a growing concern in high performance computing …
Exploiting asynchrony from exact forward recovery for due in iterative solvers
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …
Errors (DUE) relying on error detection techniques already available in commodity …
Versioned distributed arrays for resilience in scientific applications: Global view resilience
Exascale studies project reliability challenges for future high-performance computing (HPC)
systems. We propose the Global View Resilience (GVR) system, a library that enables …
systems. We propose the Global View Resilience (GVR) system, a library that enables …
Response of HPC hardware to neutron radiation at the dawn of exascale
Every computation presents a small chance that an unexpected phenomenon ruins or
modifies its output. Computers are prone to errors that, although may be very unlikely, are …
modifies its output. Computers are prone to errors that, although may be very unlikely, are …
A block-asynchronous relaxation method for graphics processing units
In this paper, we analyze the potential of asynchronous relaxation methods on Graphics
Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and …
Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and …
Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …
based on the concept of resilience-based design patterns. A design pattern is a general …
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
D Göddeke, M Altenbernd, D Ribbrock - Parallel Computing, 2015 - Elsevier
We analyse novel fault tolerance schemes for data loss in multigrid solvers, which
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …
Numerical analysis of fixed point algorithms in the presence of hardware faults
The exponential growth of computational power of the extreme scale machines over the past
few decades has led to a corresponding decrease in reliability and a sharp increase of the …
few decades has led to a corresponding decrease in reliability and a sharp increase of the …