Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …

On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications

T Ropars, A Guermouche, B Uçar, E Meneses… - Euro-Par 2011 Parallel …, 2011 - Springer
Fault tolerance is becoming a major concern in HPC systems. The two traditional
approaches for message passing applications, coordinated checkpointing and message …

Using simulation to evaluate the performance of resilience strategies at scale

S Levy, B Topp, KB Ferreira, D Arnold, T Hoefler… - … and Simulation: 4th …, 2014 - Springer
Fault-tolerance has been identified as a major challenge for future extreme-scale systems.
Current predictions suggest that, as systems grow in size, failures will occur more frequently …

Scalable group-based checkpoint/restart for large-scale message-passing systems

JCY Ho, CL Wang, FCM Lau - 2008 IEEE International …, 2008 - ieeexplore.ieee.org
The ever increasing number of processors used in parallel computers is making fault
tolerance support in large-scale parallel systems more and more important. We discuss the …

Minimum process coordinated checkpointing scheme for ad hoc networks

R Tuli, P Kumar - arxiv preprint arxiv:1111.2208, 2011 - arxiv.org
The wireless mobile ad hoc network (MANET) architecture is one consisting of a set of
mobile hosts capable of communicating with each other without the assistance of base …

[PDF][PDF] A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery

B Gupta, S Rahmi, Y Yang - Informatica, 2007 - informatica.si
In this paper, we present a high performance recovery algorithm for distributed systems in
which checkpoints are taken asynchronously. It offers fast determination of the recent …

Dynamic load balance for optimized message logging in fault tolerant hpc applications

E Meneses, LV Kalé… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org
Computing systems will grow significantly larger in the near future to satisfy the needs of
computational scientists in areas like climate modeling, biophysics and cosmology …

[LIBRO][B] Scalable message-logging techniques for effective fault tolerance in HPC applications

EM Rojas - 2013 - search.proquest.com
An important set of challenges emerge as the High Performance Computing (HPC)
community aims to reach extreme scale. Resilience and energy consumption are two of …

A recovery scheme for cluster federations using sender-based message logging

B Gupta, R Nikolaev, R Chirra - Journal of computing and information …, 2011 - hrcak.srce.hr
Sažetak A cluster federation is a union of clusters and is heterogeneous. Each cluster
contains a certain number of processes. An application running in such a computing …

Second-level algorithms, superrecursivity, and recovery problem in distributed systems

M Burgin, B Gupta - Theory of Computing Systems, 2012 - Springer
In this paper, we analyze network recovery algorithms, which allow computer networks to
properly function in spite of failures. In this analysis, we use methods and tools of the theory …