A systematic survey on fault-tolerant solutions for distributed data analytics: Taxonomy, comparison, and future directions

S Isukapalli, SN Srirama - Computer Science Review, 2024 - Elsevier
Fault tolerance is becoming increasingly important for upcoming exascale systems,
supporting distributed data processing, due to the expected decrease in the Mean Time …

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

S Jayasekara, A Harwood, S Karunasekera - Future Generation Computer …, 2020 - Elsevier
State-of-the-art distributed stream processing systems such as Apache Flink and Storm have
recently included checkpointing to provide fault-tolerance for stateful applications. This is a …

Research on optimal checkpointing-interval for flink stream processing applications

Z Zhang, W Li, X Qing, X Liu, H Liu - Mobile Networks and Applications, 2021 - Springer
Nowadays various distributed stream processing systems (DSPSs) are employed to process
the ever-expanding real-time data. The DSPSs are highly susceptible to system failure, and …

Failure recovery model in big data using the checkpoint approach

S Chorey, N Sahu - Journal of Integrated Science and …, 2023 - pubs.thesciencein.org
Distributed Stream Processing systems are becoming an increasingly crucial aspect of Big
Data processing platforms as customers grow ever more reliant on their capacity to deliver …

Dynamic Adaptive Checkpoint Mechanism for Streaming Applications Based on Reinforcement Learning

Z Zhang, T Liu, Y Shu, S Chen… - 2022 IEEE 28th …, 2023 - ieeexplore.ieee.org
For a stream processing system that uses checkpoints as a fault-tolerant method, selecting
the appropriate checkpoint period is the key to ensuring the efficient operation of streaming …

TranLogs: Lossless Failure Recovery Empowered by Training Logs

X Liu, L Zeng - … on Networking, Architecture and Storage (NAS), 2024 - ieeexplore.ieee.org
When running deep learning training jobs, in order to prevent training loss due to
softwarelhardware failures, a checkpointing mechanism is usually used to periodically store …

Multi-stage distributed computing for big data: Evaluating connective topologies

RS Gargees, GJ Scott - 2020 10th Annual Computing and …, 2020 - ieeexplore.ieee.org
With the increase in computation and data intensive needs along with the real-time
requirements in the big data era, a distributed framework that can handle parallel processing …

A model of checkpoint behavior for applications that have I/O

B León, S Méndez, D Franco, D Rexachs… - The Journal of …, 2022 - Springer
Due to the increase and complexity of computer systems, reducing the overhead of fault
tolerance techniques has become important in recent years. One technique in fault tolerance …

Work-In-Progress: Fault Tolerance in a Two-State Checkpointing Regularity-Based System

E Torre, AMK Cheng - 2020 IEEE Real-Time Systems …, 2020 - ieeexplore.ieee.org
Real-time embedded systems with safety-critical functions must often share a limited number
of computational resources. Scheduling models within the Hierarchical Real-Time …

FATM: A failure‐aware adaptive fault tolerance model for distributed stream processing systems

SMA Akber, H Chen, H ** - Concurrency and Computation …, 2021 - Wiley Online Library
Summary Distributed Stream Processing Systems (DSPS) are very popular to process
unbounded data streams in real‐time. Low processing latency is a fundamental requirement …