[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

Distributed data set storage and retrieval

BP Bowman, SE Krueger, RT Knight, CW Ho - US Patent 9,619,148, 2017 - Google Patents
An apparatus includes processor component caused to: retrieve metadata of organization of
data within a data set, and map data of organization of data blocks within a data file; receive …

Processor design for soft errors: Challenges and state of the art

T Li, JA Ambrose, R Ragel… - ACM Computing Surveys …, 2016 - dl.acm.org
Today, soft errors are one of the major design technology challenges at and beyond the
22nm technology nodes. This article introduces the soft error problem from the perspective …

Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA

Y Hong, H Ltaief, M Ravasi, L Gatineau, DE Keyes - 2021 - repository.kaust.edu.sa
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …

Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI

A Hassani, A Skjellum… - 2014 44th Annual IEEE …, 2014 - ieeexplore.ieee.org
With the rapid scale out of supercomputers comes a corresponding higher failure frequency.
Fault-tolerant methods have evolved to adapt to high rates of failure, but the behavior of MPI …

Complex scientific applications made fault-tolerant with the sparse grid combination technique

MM Ali, PE Strazdins, B Harding… - … International Journal of …, 2016 - journals.sagepub.com
Ultra-large–scale simulations via solving partial differential equations (PDEs) require very
large computational systems for their timely solution. Studies shown the rate of failure grows …

A malleable and fault-tolerant task pool framework for X10

M Bungart, C Fohry - 2017 IEEE International Conference on …, 2017 - ieeexplore.ieee.org
Current HPC environments require parallel programs that are both malleable and fault-
tolerant. Malleability denotes the ability to embrace system-initiated resource changes, and …

MPI windows on storage for HPC applications

S Rivas-Gomez, R Gioiosa, IB Peng, G Kestor… - Proceedings of the 24th …, 2017 - dl.acm.org
Upcoming HPC clusters will feature hybrid memories and storage devices per compute
node. In this work, we propose to use the MPI one-sided communication model and MPI …

NR-MPI: a Non-stop and Fault Resilient MPI

G Suo, Y Lu, X Liao, M **e… - … Conference on Parallel …, 2013 - ieeexplore.ieee.org
Fault resilience has became a major issue for HPC systems, in particular in the perspective
of future E-scale systems, which will consist of millions of CPU cores and other components …