Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Resilience for massively parallel multigrid solvers

M Huber, B Gmeiner, U Rüde, B Wohlmuth - SIAM Journal on Scientific …, 2016 - SIAM
Fault tolerant massively parallel multigrid methods for elliptic partial differential equations
are a step towards resilient solvers. Here, we combine domain partitioning with geometric …

Recent developments in the theory and application of the sparse grid combination technique

M Hegland, B Harding, C Kowitz, D Pflüger… - Software for Exascale …, 2016 - Springer
Substantial modifications of both the choice of the grids, the combination coefficients, the
parallel data structures and the algorithms used for the combination technique lead to …

Complex scientific applications made fault-tolerant with the sparse grid combination technique

MM Ali, PE Strazdins, B Harding… - … International Journal of …, 2016 - journals.sagepub.com
Ultra-large–scale simulations via solving partial differential equations (PDEs) require very
large computational systems for their timely solution. Studies shown the rate of failure grows …

A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations

M Obersteiner, AP Hinojosa, M Heene… - Proceedings of the 8th …, 2017 - dl.acm.org
With future exascale computers expected to have millions of compute units distributed
among thousands of nodes, system faults are predicted to become more frequent. Fault …

[PDF][PDF] A massively parallel combination technique for the solution of high-dimensional PDEs

M Heene - 2018 - core.ac.uk
The solution of high-dimensional problems, especially high-dimensional partial differential
equations (PDEs) that require the joint discretization of more than the usual three spatial …

Fault-Tolerant Parallel Multigrid Method on Unstructured Adaptive Mesh

F Fung, L Stals, Q Deng - SIAM Journal on Scientific Computing, 2024 - SIAM
As the generation of exascale high-performance clusters begins, it has become evident that
numerical algorithms will greatly benefit from built-in resilience features that can handle …

[PDF][PDF] EXAHD: a massively parallel fault tolerant sparse grid approach for high-dimensional turbulent plasma simulations

R Lago, M Obersteiner, T Pollinger… - Software for Exascale …, 2020 - library.oapen.org
Plasma fusion is one of the promising candidates for an emission-free energy source and is
heavily investigated with high-resolution numerical simulations. Unfortunately, these …

Handling silent data corruption with the sparse grid combination technique

AP Hinojosa, B Harding, M Hegland… - Software for Exascale …, 2016 - Springer
We describe two algorithms to detect and filter silent data corruption (SDC) when solving
time-dependent PDEs with the Sparse Grid Combination Technique (SGCT). The SGCT …

A spatially adaptive and massively parallel implementation of the fault-tolerant combination technique

MJ Obersteiner - 2021 - mediatum.ub.tum.de
In this work, we discuss measures to increase the scalability, robustness, and efficiency of
the Combination Technique. In particular, we introduce an asynchronous variant and …