[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Evaluating the impact of SDC on the GMRES iterative solver

J Elliott, M Hoemmen, F Mueller - 2014 ieee 28th international …, 2014 - ieeexplore.ieee.org
Increasing parallelism and transistor density, along with increasingly tighter energy and
peak power constraints, may force exposure of occasionally incorrect computation or …

/spl times/pipes Lite: a synthesis oriented design library for networks on chips

S Stergiou, F Angiolini, S Carta, L Raffo… - … Automation and Test …, 2005 - ieeexplore.ieee.org
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …

Improving performance of iterative methods by lossy checkponting

D Tao, S Di, X Liang, Z Chen, F Cappello - Proceedings of the 27th …, 2018 - dl.acm.org
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …

New-sum: A novel online abft scheme for general iterative methods

D Tao, SL Song, S Krishnamoorthy, P Wu… - Proceedings of the 25th …, 2016 - dl.acm.org
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …

Correcting soft errors online in fast fourier transform

X Liang, J Chen, D Tao, S Li, P Wu, H Li… - Proceedings of the …, 2017 - dl.acm.org
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …

Correcting for unknown errors in sparse high-dimensional function approximation

B Adcock, A Bao, S Brugiapaglia - Numerische Mathematik, 2019 - Springer
We consider sparsity-based techniques for the approximation of high-dimensional functions
from random pointwise evaluations. To date, almost all the works published in this field …

Towards a more complete understanding of SDC propagation

J Calhoun, M Snir, LN Olson, WD Gropp - Proceedings of the 26th …, 2017 - dl.acm.org
With the rate of errors that can silently effect an application's state/output expected to
increase on future HPC machines, numerous application-level detection and recovery …

Resilience for massively parallel multigrid solvers

M Huber, B Gmeiner, U Rüde, B Wohlmuth - SIAM Journal on Scientific …, 2016 - SIAM
Fault tolerant massively parallel multigrid methods for elliptic partial differential equations
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …