Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
[책][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
Reliability in grid computing systems
C Dabrowski - Concurrency and Computation: Practice and …, 2009 - Wiley Online Library
In recent years, grid technology has emerged as an important tool for solving compute‐
intensive problems within the scientific community and in industry. To further the …
intensive problems within the scientific community and in industry. To further the …
Cloak and dagger: from two permissions to complete control of the UI feedback loop
The effectiveness of the Android permission system fundamentally hinges on the user's
correct understanding of the capabilities of the permissions being granted. In this paper, we …
correct understanding of the capabilities of the permissions being granted. In this paper, we …
Uncoordinated checkpointing without domino effect for send-deterministic MPI applications
As reported by many recent studies, the mean time between failures of future post-petascale
supercomputers is likely to reduce, compared to the current situation. The most popular fault …
supercomputers is likely to reduce, compared to the current situation. The most popular fault …
Exploring automatic, online failure recovery for scientific applications at extreme scales
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …
exascale vision. Process/node failures, an important class of failures, are typically handled …
Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems
We tackle the problem of scheduling task graphs onto a heterogeneous set of machines,
where each processor has a probability of failure governed by an exponential law. The goal …
where each processor has a probability of failure governed by an exponential law. The goal …
Fault tolerance of MPI applications in exascale systems: The ULFM solution
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
The reliability wall for exascale supercomputing
X Yang, Z Wang, J Xue, Y Zhou - IEEE Transactions on …, 2011 - ieeexplore.ieee.org
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing
into reality. Inevitably, large-scale supercomputing systems, especially those at the …
into reality. Inevitably, large-scale supercomputing systems, especially those at the …
[PDF][PDF] The case for modular redundancy in large-scale high performance computing systems
C Engelmann, HH Ong… - Proceedings of the 8th …, 2009 - christian-engelmann.de
Recent investigations into resilience of large-scale highperformance computing (HPC)
systems showed a continuous trend of decreasing reliability and availability. Newly installed …
systems showed a continuous trend of decreasing reliability and availability. Newly installed …