An efficient protocol for checkpointing recovery in distributed systems

JL Kim, T Park - IEEE Transactions on Parallel and Distributed …, 1993 - ieeexplore.ieee.org
The authors present an efficient synchronized checkpointing protocol that exploits the
dependency relation between processes in distributed systems. In this protocol, a process …

Impact of checkpoint latency on overhead ratio of a checkpointing scheme

NH Vaidya - IEEE Transactions on Computers, 1997 - ieeexplore.ieee.org
Checkpointing reduces loss of computation in the presence of failures. Two metrics
characterize a checkpointing scheme: checkpoint overhead and checkpoint latency. The …

A variational calculus approach to optimal checkpoint placement

Y Ling, J Mi, X Lin - IEEE Transactions on computers, 2001 - ieeexplore.ieee.org
Checkpointing is an effective fault-tolerant technique for improving system availability and
reliability. However, a blind checkpointing placement can result in either performance …

Analysis of checkpointing for real-time systems

S Punnekkat, A Burns, R Davis - Real-Time Systems, 2001 - Springer
Predictable performance in the event of failuresis of paramount importance in most safety
critical real-timesystems. Various hardware as well as software fault-toleranttechniques are …

The University of Chicago

ES - Minerva, 1975 - JSTOR
On 20 March, 1974, Professor Edward Banfield, of the University of Pennsylvania, was
prevented from delivering a lecture at the University of Chicago. Professor Banfield, who is a …

Catch-compiler-assisted techniques for checkpointing

CCJ Li, WK Fuchs - Digest of Papers. Fault-Tolerant Computing: 20th …, 1990 - computer.org
Many real-time applications require one to many (multicast) communication. Real time
applications can gracefully accommodate some loss but require low delay. We minimize the …

[PDF][PDF] Predicting Computer System Failures Using Support Vector Machines.

EW Fulp, GA Fink, JN Haack - WASL, 2008 - usenix.org
Mitigating the impact of computer failure is possible if accurate failure predictions are
provided. Resources, applications, and services can be scheduled around predicted failure …

Checkpointing strategies for parallel jobs

M Bougeret, H Casanova, M Rabie, Y Robert… - Proceedings of 2011 …, 2011 - dl.acm.org
This work provides an analysis of checkpointing strategies for minimizing expected job
execution times in an environment that is subject to processor failures. In the case of both …

A case for two-level distributed recovery schemes

NH Vaidya - Proceedings of the 1995 ACM SIGMETRICS joint …, 1995 - dl.acm.org
Most distributed and multiprocessor recovery schemes proposed in the literature are
designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often …

Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids

M Chtepen, FHA Claeys, B Dhoedt… - … on Parallel and …, 2008 - ieeexplore.ieee.org
A grid is a distributed computational and storage environment often composed of
heterogeneous autonomously managed subsystems. As a result, varying resource …