Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

[책][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Reliability in grid computing systems

C Dabrowski - Concurrency and Computation: Practice and …, 2009 - Wiley Online Library
In recent years, grid technology has emerged as an important tool for solving compute‐
intensive problems within the scientific community and in industry. To further the …

Cloak and dagger: from two permissions to complete control of the UI feedback loop

Y Fratantonio, C Qian, SP Chung… - 2017 IEEE Symposium …, 2017 - ieeexplore.ieee.org
The effectiveness of the Android permission system fundamentally hinges on the user's
correct understanding of the capabilities of the permissions being granted. In this paper, we …

Uncoordinated checkpointing without domino effect for send-deterministic MPI applications

A Guermouche, T Ropars, E Brunet… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org
As reported by many recent studies, the mean time between failures of future post-petascale
supercomputers is likely to reduce, compared to the current situation. The most popular fault …

Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

JJ Dongarra, E Jeannot, E Saule, Z Shi - Proceedings of the nineteenth …, 2007 - dl.acm.org
We tackle the problem of scheduling task graphs onto a heterogeneous set of machines,
where each processor has a probability of failure governed by an exponential law. The goal …

Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

The reliability wall for exascale supercomputing

X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …

[PDF][PDF] The case for modular redundancy in large-scale high performance computing systems

C Engelmann, HH Ong… - Proceedings of the 8th …, 2009 - christian-engelmann.de
Recent investigations into resilience of large-scale highperformance computing (HPC)
systems showed a continuous trend of decreasing reliability and availability. Newly installed …