Fundamentals of fault-tolerant distributed computing in asynchronous environments

FC Gärtner - ACM Computing Surveys (CSUR), 1999 - dl.acm.org
Fault tolerance in distributed computing is a wide area with a significant body of literature
that is vastly diverse in methodology and terminology. This paper aims at structuring the …

The failure detector abstraction

FC Freiling, R Guerraoui, P Kuznetsov - ACM Computing Surveys …, 2011 - dl.acm.org
A failure detector is a fundamental abstraction in distributed computing. This article surveys
this abstraction through two dimensions. First we study failure detectors as building blocks to …

Failure detection and consensus in the crash-recovery model

MK Aguilera, W Chen, S Toueg - Distributed computing, 2000 - Springer
We study the problems of failure detection and consensus in asynchronous systems in
which processes may crash and recover, and links may lose messages. We first propose …

Leader-based consensus

A Mostefaoui, M Raynal - Parallel Processing Letters, 2001 - World Scientific
It is now well recognized that consensus is a fundamental problem one has to solve to
implement reliable applications on top of unreliable asynchronous distributed systems prone …

The generic consensus service

R Guerraoui, A Schiper - IEEE Transactions on Software …, 2001 - ieeexplore.ieee.org
This paper describes a modular approach for the construction of fault-tolerant agreement
protocols. The approach is based on a generic consensus service. Fault-tolerant agreement …

Consensus system for solving conflicts in distributed systems

NT Nguyen - Information Sciences, 2002 - Elsevier
By a data conflict in a distributed system we understand a situation (or a state of the system)
in which the system sites generate and store different versions of data which represent the …

Failure detection and consensus in the crash-recovery model

MK Aguilera, W Chen, S Toueg - … , DISC'98 Andros, Greece, September 24 …, 1998 - Springer
We study the problems of failure detection and consensus in asynchronous systems in
which processes may crash and recover, and links may lose messages. We first propose …

Consensus in asynchronous distributed systems: A concise guided tour

R Guerraoui, M Hurfinn, A Mostéfaoui… - Advances in Distributed …, 2000 - Springer
It is now recognized that the Consensus problem is a fundamental problem when one has to
design and implement reliable asynchronous distributed systems. This chapter is on the …

Fault-tolerant total order multicast to asynchronous groups

U Fritzke, P Ingels, A Mostéfaoui… - … IEEE Symposium on …, 1998 - ieeexplore.ieee.org
While Total Order Broadcast (or Atomic Broadcast) primitives have received a lot of attention,
the paper concentrates on Total Order Multicast to Multiple Groups in the context of …

On quiescent reliable communication

MK Aguilera, W Chen, S Toueg - SIAM Journal on Computing, 2000 - SIAM
We study the problem of achieving reliable communication with quiescent algorithms (ie,
algorithms that eventually stop sending messages) in asynchronous systems with process …