[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

[책][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

An evaluation of user-level failure mitigation support in MPI

W Bland, A Bouteiller, T Herault, J Hursey… - Recent Advances in the …, 2012 - Springer
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …

The EH model: Early design space exploration of intermittent processor architectures

J San Miguel, K Ganesan, M Badr, C **a… - 2018 51st Annual …, 2018 - ieeexplore.ieee.org
Energy-harvesting devices—which operate solely on energy collected from their
environment—have brought forth a new paradigm of intermittent computing. These devices …

Sizing and partitioning strategies for burst-buffers to reduce io contention

G Aupy, O Beaumont… - 2019 IEEE international …, 2019 - ieeexplore.ieee.org
Burst-Buffers are high throughput and small size storage which are being used as an
intermediate storage between the PFS (Parallel File System) and the computational nodes …

Towards optimal multi-level checkpointing

A Benoit, A Cavelan, V Le Fèvre… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
We provide a framework to analyze multi-level checkpointing protocols, by formally defining
a-level checkpointing pattern. We provide a first-order approximation to the optimal …

An evaluation of user-level failure mitigation support in MPI

W Bland, A Bouteiller, T Herault, J Hursey, G Bosilca… - Computing, 2013 - Springer
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …

Resilience for massively parallel multigrid solvers

M Huber, B Gmeiner, U Rüde, B Wohlmuth - SIAM Journal on Scientific …, 2016 - SIAM
Fault tolerant massively parallel multigrid methods for elliptic partial differential equations
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …

Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA

Y Hong, H Ltaief, M Ravasi, L Gatineau, DE Keyes - 2021 - repository.kaust.edu.sa
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

O Subasi, T Martsinkevich… - … Journal of High …, 2018 - journals.sagepub.com
We present a unified fault-tolerance framework for task-parallel message-passing
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …