Igor: Accelerating byzantine fault tolerance for real-time systems with eager execution

A Loveless, R Dreslinski, B Kasikci… - 2021 IEEE 27th Real …, 2021‏ - ieeexplore.ieee.org
Critical real-time systems like spacecraft and aircraft commonly use Byzantine fault-tolerant
(BFT) state machine replication (SMR) to mask faulty processors and sensors. Unfortunately …

High performance recovery for parallel state machine replication

OM Mendizabal, FL Dotti… - 2017 IEEE 37th …, 2017‏ - ieeexplore.ieee.org
State machine replication is a fundamental approach to high availability. Despite the vast
literature on the topic, relatively few studies have considered the issues involved in …

Checkpointing techniques in distributed systems: A synopsis of diverse strategies over the last decades

H Goulart, A Franco, O Mendizabal - … de Testes e Tolerância a Falhas …, 2023‏ - sol.sbc.org.br
This paper concisely reviews checkpointing techniques in distributed systems, focusing on
various aspects such as coordinated and uncoordinated checkpointing, incremental …

Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned Checkpoint

E Gomes Jr, E Alchieri, F Dotti… - Journal of Internet …, 2024‏ - journals-sol.sbc.org.br
Dependable systems usually rely on replication to provide resilience and availability.
However, for long-lived systems, replication is not enough since given a sufficient amount of …

Analysis of checkpointing overhead in parallel state machine replication

OM Mendizabal, FL Dotti, F Pedone - Proceedings of the 31st Annual …, 2016‏ - dl.acm.org
State machine replication (SMR) is a well-established technique to fault-tolerant systems. In
part, this is explained by the simplicity of the approach and its strong consistency …

Generic Checkpointing Support for Stream-based State-Machine Replication

L Lawniczak, M Ammon, T Distler - Proceedings of the 10th Workshop on …, 2023‏ - dl.acm.org
Stream-based replication facilitates the deployment and operation of state-machine
replication protocols by running them as applications on top of data-stream processing …

Overcoming the Performance and Security Challenges of Building Highly-Distributed Fault-Tolerant Embedded Systems

A Loveless - 2023‏ - deepblue.lib.umich.edu
Over the past few decades, embedded systems, like those in spacecraft and aircraft, have
evolved into complex distributed systems with hundreds of nodes and dozens of network …

The Optimal Checkpoint Interval for the Long-Running Application

Y Zhai, W Li - International Journal of Advanced Pervasive and …, 2017‏ - igi-global.com
For the distributed computing system, excessive or deficient checkpointing operations would
result in severe performance degradation. To minimize the expected computation execution …

[PDF][PDF] Fast recovery in parallel state machine replication

OM Mendizabal - 2016‏ - repositorio.pucrs.br
A replicação máquina de estados é uma técnica bem estabelecida para desenvolvimento
de sistemas tolerantes a faltas. Em parte, isso é explicado pela simplicidade da abordagem …

The Checkpoint-Timing for Backward Fault-Tolerant Schemes

M Zhang - … Computer Architecture: 12th Conference, ACA 2018 …, 2018‏ - Springer
To improve the performance of the backward fault tolerant scheme in the long-running
parallel application, a general checkpoint-timing method was proposed to determine the …