[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

An evaluation of user-level failure mitigation support in MPI

W Bland, A Bouteiller, T Herault, J Hursey… - Recent Advances in the …, 2012 - Springer
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …

Evaluating and extending user-level fault tolerance in MPI applications

I Laguna, DF Richards, T Gamblin… - … Journal of High …, 2016 - journals.sagepub.com
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-
tolerant semantics in the Message Passing Interface (MPI). Previous work presented …

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier
The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

Parallel processing strategies for big geospatial data

M Werner - Frontiers in big Data, 2019 - frontiersin.org
This paper provides an abstract analysis of parallel processing strategies for spatial and
spatio-temporal data. It isolates aspects such as data locality and computational locality as …

Evaluating user-level fault tolerance for MPI applications

I Laguna, DF Richards, T Gamblin, M Schulz… - Proceedings of the 21st …, 2014 - dl.acm.org
The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-
tolerant semantics in MPI. Previous work has presented performance evaluations of the …

Twister2: Design of a big data toolkit

S Kamburugamuve, K Govindarajan… - Concurrency and …, 2020 - Wiley Online Library
Data‐driven applications are essential to handle the ever‐increasing volume, velocity, and
veracity of data generated by sources such as the Web and Internet of Things (IoT) devices …

An evaluation of user-level failure mitigation support in MPI

W Bland, A Bouteiller, T Herault, J Hursey, G Bosilca… - Computing, 2013 - Springer
As the scale of computing platforms becomes increasingly extreme, the requirements for
application fault tolerance are increasing as well. Techniques to address this problem by …

Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA

Y Hong, H Ltaief, M Ravasi, L Gatineau, DE Keyes - 2021 - repository.kaust.edu.sa
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …