Processor design for soft errors: Challenges and state of the art
Today, soft errors are one of the major design technology challenges at and beyond the
22nm technology nodes. This article introduces the soft error problem from the perspective …
22nm technology nodes. This article introduces the soft error problem from the perspective …
Practical resource management in power-constrained, high performance computing
Power management is one of the key research challenges on the path to exascale.
Supercomputers today are designed to be worst-case power provisioned, leading to two …
Supercomputers today are designed to be worst-case power provisioned, leading to two …
Fault tolerance of MPI applications in exascale systems: The ULFM solution
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
[PDF][PDF] Collaboro: a collaborative (meta) modeling tool
Motivation Scientists increasingly rely on intelligent information systems to help them in their
daily tasks, in particular for managing research objects, like publications or datasets. The …
daily tasks, in particular for managing research objects, like publications or datasets. The …
Evaluating and extending user-level fault tolerance in MPI applications
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-
tolerant semantics in the Message Passing Interface (MPI). Previous work presented …
tolerant semantics in the Message Passing Interface (MPI). Previous work presented …
Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures
V Munhoz, M Castro… - 2022 IEEE 34th …, 2022 - ieeexplore.ieee.org
Cloud providers can rent their spare computing capacity at substantial discounts, reclaiming
it whenever there is a more profitable higher-priority request-a business model well known …
it whenever there is a more profitable higher-priority request-a business model well known …
Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …
number of hardware components. In standard practice, applications are made resilient …
Local rollback for resilient MPI applications with application-level checkpointing and message logging
The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …
coordinated checkpoint/restart, a global rollback of all the processes that are running the …
Exascale machines require new programming paradigms and runtimes
Extreme scale parallel computing systems will have tens of thousands of optionally
accelerator-equiped nodes with hundreds of cores each, as well as deep memory …
accelerator-equiped nodes with hundreds of cores each, as well as deep memory …
Autocheck: Automatically identifying variables for checkpointing by data dependency analysis
X Fu, W Zhang, S Meng, X Huang, W Xu… - … Conference for High …, 2024 - ieeexplore.ieee.org
Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds,
and industrial data centers, which are typically operated by system engineers. Nevertheless …
and industrial data centers, which are typically operated by system engineers. Nevertheless …