[HTML][HTML] Toward exascale resilience: 2014 update
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
The landscape of exascale research: A data-driven literature analysis
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …
systems capable of at least one quintillion (billion billion) floating-point operations per …
Distributed data set storage and retrieval
BP Bowman, SE Krueger, RT Knight, CW Ho - US Patent 9,619,148, 2017 - Google Patents
An apparatus includes processor component caused to: retrieve metadata of organization of
data within a data set, and map data of organization of data blocks within a data file; receive …
data within a data set, and map data of organization of data blocks within a data file; receive …
Processor design for soft errors: Challenges and state of the art
Today, soft errors are one of the major design technology challenges at and beyond the
22nm technology nodes. This article introduces the soft error problem from the perspective …
22nm technology nodes. This article introduces the soft error problem from the perspective …
Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …
the Earth must be numerically re-positioned at locations in the subsurface where reflections …
Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI
With the rapid scale out of supercomputers comes a corresponding higher failure frequency.
Fault-tolerant methods have evolved to adapt to high rates of failure, but the behavior of MPI …
Fault-tolerant methods have evolved to adapt to high rates of failure, but the behavior of MPI …
Complex scientific applications made fault-tolerant with the sparse grid combination technique
Ultra-large–scale simulations via solving partial differential equations (PDEs) require very
large computational systems for their timely solution. Studies shown the rate of failure grows …
large computational systems for their timely solution. Studies shown the rate of failure grows …
A malleable and fault-tolerant task pool framework for X10
M Bungart, C Fohry - 2017 IEEE International Conference on …, 2017 - ieeexplore.ieee.org
Current HPC environments require parallel programs that are both malleable and fault-
tolerant. Malleability denotes the ability to embrace system-initiated resource changes, and …
tolerant. Malleability denotes the ability to embrace system-initiated resource changes, and …
MPI windows on storage for HPC applications
Upcoming HPC clusters will feature hybrid memories and storage devices per compute
node. In this work, we propose to use the MPI one-sided communication model and MPI …
node. In this work, we propose to use the MPI one-sided communication model and MPI …
NR-MPI: a Non-stop and Fault Resilient MPI
G Suo, Y Lu, X Liao, M **e… - … Conference on Parallel …, 2013 - ieeexplore.ieee.org
Fault resilience has became a major issue for HPC systems, in particular in the perspective
of future E-scale systems, which will consist of millions of CPU cores and other components …
of future E-scale systems, which will consist of millions of CPU cores and other components …