A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Model checking guided testing for distributed systems

D Wang, W Dou, Y Gao, C Wu, J Wei… - Proceedings of the …, 2023 - dl.acm.org
Distributed systems have become the backbone of cloud computing. Incorrect system
designs and implementations can greatly impair the reliability of distributed systems …

SandTable: Scalable Distributed System Model Checking with Specification-Level State Exploration

R Tang, X Sun, Y Huang, Y Wei, L Ouyang… - Proceedings of the …, 2024 - dl.acm.org
Implementation-level distributed system model checkers (DMCKs) have proven valuable in
verifying the correctness of real distributed systems. However, they primarily focus on state …

Chronos: Finding timeout bugs in practical distributed systems by deep-priority fuzzing with transient delay

Y Chen, F Ma, Y Zhou, M Gu, Q Liao… - 2024 IEEE Symposium …, 2024 - ieeexplore.ieee.org
Delays are inevitable in complex distributed environments. Timeout mechanisms are
commonly used to handle unexpected failures in distributed systems. However, incorrect …

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

A Borgarelli, C Enea, R Majumdar… - Proceedings of the ACM …, 2024 - dl.acm.org
Bugs in popular distributed protocol implementations have been the source of many
downtimes in popular internet services. We describe a randomized testing approach for …

An Empirical Study on Kubernetes Operator Bugs

Q Xu, Y Gao, J Wei - Proceedings of the 33rd ACM SIGSOFT …, 2024 - dl.acm.org
Kubernetes is the leading cluster management platform, and within Kubernetes, an operator
is an application-specific program that leverages the Kubernetes API to automate operation …

Faultfuzz: A coverage guided fault injection tool for distributed systems

W Feng, Q Pei, Y Gao, D Wang, W Dou, J Wei… - Proceedings of the …, 2024 - dl.acm.org
Distributed systems are expected to correctly recover from various faults, eg, node
crash/reboot and network disconnection/reconnection. However, faults that occur under …

Model-guided Fuzzing of Distributed Systems

EB Gulcan, BK Ozkan, R Majumdar… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a coverage-guided testing algorithm for distributed systems implementations.
Our main innovation is the use of an abstract formal model of the system that is used to …

[PDF][PDF] Blackbox Fuzzing of Distributed Systems with Multi-Dimensional Inputs and Symmetry-Based Feedback Pruning

Y Zou, JJ Bai, ZM Jiang, M Zhao, D Zhou - jzuming.github.io
This paper presents DistFuzz, which, to our knowledge, is the first feedback-guided blackbox
fuzzing framework for distributed systems. The novelty of DistFuzz comes from two …