[HTML][HTML] Toward exascale resilience: 2014 update
F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
Predictive reliability and fault management in exascale systems: State of the art and perspectives
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
Evaluating the impact of SDC on the GMRES iterative solver
Increasing parallelism and transistor density, along with increasingly tighter energy and
peak power constraints, may force exposure of occasionally incorrect computation or …
peak power constraints, may force exposure of occasionally incorrect computation or …
/spl times/pipes Lite: a synthesis oriented design library for networks on chips
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
Improving performance of iterative methods by lossy checkponting
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …
which are fundamental operations for many modern scientific simulations. When the large …
New-sum: A novel online abft scheme for general iterative methods
Emerging high-performance computing platforms, with large component counts and lower
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …
power margins, are anticipated to be more susceptible to soft errors in both logic circuits and …
Correcting soft errors online in fast fourier transform
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …
Correcting for unknown errors in sparse high-dimensional function approximation
We consider sparsity-based techniques for the approximation of high-dimensional functions
from random pointwise evaluations. To date, almost all the works published in this field …
from random pointwise evaluations. To date, almost all the works published in this field …
Towards a more complete understanding of SDC propagation
With the rate of errors that can silently effect an application's state/output expected to
increase on future HPC machines, numerous application-level detection and recovery …
increase on future HPC machines, numerous application-level detection and recovery …
Resilience for massively parallel multigrid solvers
Fault tolerant massively parallel multigrid methods for elliptic partial differential equations
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …