Processor design for soft errors: Challenges and state of the art

T Li, JA Ambrose, R Ragel… - ACM Computing Surveys …, 2016 - dl.acm.org
Today, soft errors are one of the major design technology challenges at and beyond the
22nm technology nodes. This article introduces the soft error problem from the perspective …

Practical resource management in power-constrained, high performance computing

T Patki, DK Lowenthal, A Sasidharan… - Proceedings of the 24th …, 2015 - dl.acm.org
Power management is one of the key research challenges on the path to exascale.
Supercomputers today are designed to be worst-case power provisioned, leading to two …

Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

[PDF][PDF] Collaboro: a collaborative (meta) modeling tool

JLC Izquierdo, J Cabot - PeerJ Computer Science, 2016 - peerj.com
Motivation Scientists increasingly rely on intelligent information systems to help them in their
daily tasks, in particular for managing research objects, like publications or datasets. The …

Evaluating and extending user-level fault tolerance in MPI applications

I Laguna, DF Richards, T Gamblin… - … Journal of High …, 2016 - journals.sagepub.com
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-
tolerant semantics in the Message Passing Interface (MPI). Previous work presented …

Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures

V Munhoz, M Castro… - 2022 IEEE 34th …, 2022 - ieeexplore.ieee.org
Cloud providers can rent their spare computing capacity at substantial discounts, reclaiming
it whenever there is a more profitable higher-priority request-a business model well known …

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier
The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

Exascale machines require new programming paradigms and runtimes

G Da Costa, T Fahringer, JAR Gallego… - Supercomputing …, 2015 - superfri.org
Extreme scale parallel computing systems will have tens of thousands of optionally
accelerator-equiped nodes with hundreds of cores each, as well as deep memory …

Autocheck: Automatically identifying variables for checkpointing by data dependency analysis

X Fu, W Zhang, S Meng, X Huang, W Xu… - … Conference for High …, 2024 - ieeexplore.ieee.org
Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds,
and industrial data centers, which are typically operated by system engineers. Nevertheless …