[HTML][HTML] A taxonomy of task-based parallel programming technologies for high-performance computing

P Thoman, K Dichev, T Heller, R Iakymchuk… - The Journal of …, 2018 - Springer
Task-based programming models for shared memory—such as Cilk Plus and OpenMP 3—
are well established and documented. However, with the increase in parallel, many-core …

Self-stabilizing iterative solvers

P Sao, R Vuduc - Proceedings of the workshop on latest advances in …, 2013 - dl.acm.org
We show how to use the idea of self-stabilization, which originates in the context of
distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system …

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

J Chung, I Lee, M Sullivan, JH Ryoo… - Scientific …, 2013 - content.iospress.com
This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Silent error detection in numerical time-step** schemes

AR Benson, S Schmit… - The International Journal …, 2015 - journals.sagepub.com
Errors due to hardware or low-level software problems, if detected, can be fixed by various
schemes, such as recomputation from a checkpoint. Silent errors are errors in application …

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

T Benacchio, L Bonaventura… - … Journal of High …, 2021 - journals.sagepub.com
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …

When is multi-version checkpointing needed?

G Lu, Z Zheng, AA Chien - Proceedings of the 3rd Workshop on Fault …, 2013 - dl.acm.org
The scaling of semiconductor technology and increasing power concerns combined with
system scale make fault management a growing concern in high performance computing …

[PDF][PDF] Quantifying the impact of single bit flips on floating point arithmetic

J Elliott, F Mueller, F Stoyanov, C Webster - 2013 - repository.lib.ncsu.edu
In high-end computing, the collective surface area, smaller fabrication sizes, and increasing
density of components have led to an increase in the number of observed bit flips. Such flips …

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org
This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

Shrink or substitute: handling process failures in HPC systems using in-situ recovery

RA Ashraf, S Hukerikar… - 2018 26th Euromicro …, 2018 - ieeexplore.ieee.org
Efficient utilization of today's high-performance computing (HPC) systems with complex
software and hardware components requires that the HPC applications are designed to …