[LIVRE][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …
period for a parallel application executing on a supercomputing platform. It was originally …
/spl times/pipes Lite: a synthesis oriented design library for networks on chips
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …
Improving performance of iterative methods by lossy checkponting
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …
which are fundamental operations for many modern scientific simulations. When the large …
New-sum: A novel online abft scheme for general iterative methods
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …
Correcting soft errors online in fast fourier transform
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …
Fliptracker: Understanding natural error resilience in hpc applications
As high-performance computing systems scale in size and computational power, the danger
of silent errors, ie, errors that can bypass hardware detection mechanisms and impact …
of silent errors, ie, errors that can bypass hardware detection mechanisms and impact …
Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus
Extensive researches have been done on develo** and optimizing algorithm-based fault
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …
Assessing general-purpose algorithms to cope with fail-stop and silent errors
In this article, we combine the traditional checkpointing and rollback recovery strategies with
verification mechanisms to cope with both fail-stop and silent errors. The objective is to …
verification mechanisms to cope with both fail-stop and silent errors. The objective is to …