[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

The future of scientific workflows

E Deelman, T Peterka, I Altintas… - … Journal of High …, 2018 - journals.sagepub.com
Today's computational, experimental, and observational sciences rely on computations that
involve many related tasks. The success of a scientific mission often hinges on the computer …

FTI: High performance fault tolerance interface for hybrid systems

L Bautista-Gomez, S Tsuboi, D Komatitsch… - Proceedings of 2011 …, 2011 - dl.acm.org
Large scientific applications deployed on current petascale systems expend a significant
amount of their execution time dum** checkpoint files to remote storage. New fault tolerant …

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

Evaluating the viability of process replication reliability for exascale systems

K Ferreira, J Stearley, JH Laros III, R Oldfield… - Proceedings of 2011 …, 2011 - dl.acm.org
As high-end computing machines continue to grow in size, issues such as fault tolerance
and reliability limit application scalability. Current techniques to ensure progress across …

A 5D gyrokinetic full-f global semi-Lagrangian code for flux-driven ion turbulence simulations

V Grandgirard, J Abiteboul, J Bigot… - Computer physics …, 2016 - Elsevier
This paper addresses non-linear gyrokinetic simulations of ion temperature gradient (ITG)
turbulence in tokamak plasmas. The electrostatic GYSELA code is one of the few …

Algorithm-based fault tolerance for dense matrix factorizations

P Du, A Bouteiller, G Bosilca, T Herault… - Acm sigplan notices, 2012 - dl.acm.org
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific
applications that require solving systems of linear equations, eigenvalues and linear least …

A scalable double in-memory checkpoint and restart scheme towards exascale

G Zheng, X Ni, LV Kalé - IEEE/IFIP International Conference on …, 2012 - ieeexplore.ieee.org
As the size of supercomputers increases, the probability of system failure grows
substantially, posing an increasingly significant challenge for scalability. It is important to …