[HTML][HTML] Toward exascale resilience: 2014 update
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
[책][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
An evaluation of user-level failure mitigation support in MPI
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …
application fault tolerance are increasing as well. Techniques to address this problem by …
The EH model: Early design space exploration of intermittent processor architectures
Energy-harvesting devices—which operate solely on energy collected from their
environment—have brought forth a new paradigm of intermittent computing. These devices …
environment—have brought forth a new paradigm of intermittent computing. These devices …
Sizing and partitioning strategies for burst-buffers to reduce io contention
Burst-Buffers are high throughput and small size storage which are being used as an
intermediate storage between the PFS (Parallel File System) and the computational nodes …
intermediate storage between the PFS (Parallel File System) and the computational nodes …
Towards optimal multi-level checkpointing
We provide a framework to analyze multi-level checkpointing protocols, by formally defining
a-level checkpointing pattern. We provide a first-order approximation to the optimal …
a-level checkpointing pattern. We provide a first-order approximation to the optimal …
An evaluation of user-level failure mitigation support in MPI
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …
application fault tolerance are increasing as well. Techniques to address this problem by …
Resilience for massively parallel multigrid solvers
Fault tolerant massively parallel multigrid methods for elliptic partial differential equations
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …
Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …
the Earth must be numerically re-positioned at locations in the subsurface where reflections …
Unified fault-tolerance framework for hybrid task-parallel message-passing applications
We present a unified fault-tolerance framework for task-parallel message-passing
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …