[HTML][HTML] A taxonomy of task-based parallel programming technologies for high-performance computing
Task-based programming models for shared memory—such as Cilk Plus and OpenMP 3—
are well established and documented. However, with the increase in parallel, many-core …
are well established and documented. However, with the increase in parallel, many-core …
Self-stabilizing iterative solvers
We show how to use the idea of self-stabilization, which originates in the context of
distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system …
distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system …
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …
concept of containment domains. Containment domains are a programming construct that …
Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …
Silent error detection in numerical time-step** schemes
Errors due to hardware or low-level software problems, if detected, can be fixed by various
schemes, such as recomputation from a checkpoint. Silent errors are errors in application …
schemes, such as recomputation from a checkpoint. Silent errors are errors in application …
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …
growth of the available computing power. As the number of cores in top computing facilities …
When is multi-version checkpointing needed?
The scaling of semiconductor technology and increasing power concerns combined with
system scale make fault management a growing concern in high performance computing …
system scale make fault management a growing concern in high performance computing …
[PDF][PDF] Quantifying the impact of single bit flips on floating point arithmetic
In high-end computing, the collective surface area, smaller fabrication sizes, and increasing
density of components have led to an increase in the number of observed bit flips. Such flips …
density of components have led to an increase in the number of observed bit flips. Such flips …
Exploiting asynchrony from exact forward recovery for due in iterative solvers
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …
Errors (DUE) relying on error detection techniques already available in commodity …
Shrink or substitute: handling process failures in HPC systems using in-situ recovery
Efficient utilization of today's high-performance computing (HPC) systems with complex
software and hardware components requires that the HPC applications are designed to …
software and hardware components requires that the HPC applications are designed to …