Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods

Z Chen - ACM SIGPLAN Notices, 2013 - dl.acm.org
Soft errors are one-time events that corrupt the state of a computing system but not its overall
functionality. Large supercomputers are especially susceptible to soft errors because of their …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arxiv preprint arxiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

When is multi-version checkpointing needed?

G Lu, Z Zheng, AA Chien - Proceedings of the 3rd Workshop on Fault …, 2013 - dl.acm.org
The scaling of semiconductor technology and increasing power concerns combined with
system scale make fault management a growing concern in high performance computing …

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

Versioned distributed arrays for resilience in scientific applications: Global view resilience

A Chien, P Balaji, P Beckman, N Dun, A Fang… - Procedia Computer …, 2015 - Elsevier
Exascale studies project reliability challenges for future high-performance computing (HPC)
systems. We propose the Global View Resilience (GVR) system, a library that enables …

Response of HPC hardware to neutron radiation at the dawn of exascale

A Bustos, AJ Rubio-Montero, R Méndez… - The Journal of …, 2023 - Springer
Every computation presents a small chance that an unexpected phenomenon ruins or
modifies its output. Computers are prone to errors that, although may be very unlikely, are …

A block-asynchronous relaxation method for graphics processing units

H Anzt, S Tomov, J Dongarra, V Heuveline - Journal of Parallel and …, 2013 - Elsevier
In this paper, we analyze the potential of asynchronous relaxation methods on Graphics
Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and …

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arxiv preprint arxiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

D Göddeke, M Altenbernd, D Ribbrock - Parallel Computing, 2015 - Elsevier
We analyse novel fault tolerance schemes for data loss in multigrid solvers, which
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …

Numerical analysis of fixed point algorithms in the presence of hardware faults

M Stoyanov, C Webster - SIAM Journal on Scientific Computing, 2015 - SIAM
The exponential growth of computational power of the extreme scale machines over the past
few decades has led to a corresponding decrease in reliability and a sharp increase of the …