Halfmoon: Log-optimal fault-tolerant stateful serverless computing

S Qi, X Liu, X ** - Proceedings of the 29th Symposium on Operating …, 2023 - dl.acm.org
Serverless computing separates function execution from state management. Simple retry-
based fault tolerance might corrupt the shared state with duplicate updates. Existing …

Greybox fuzzing of distributed systems

R Meng, G Pîrlea, A Roychoudhury… - Proceedings of the 2023 …, 2023 - dl.acm.org
Grey-box fuzzing is the lightweight approach of choice for finding bugs in sequential
programs. It provides a balance between efficiency and effectiveness by conducting a …

Model checking guided testing for distributed systems

D Wang, W Dou, Y Gao, C Wu, J Wei… - Proceedings of the …, 2023 - dl.acm.org
Distributed systems have become the backbone of cloud computing. Incorrect system
designs and implementations can greatly impair the reliability of distributed systems …

Automatic reliability testing for cluster management controllers

X Sun, W Luo, JT Gu, A Ganesan… - … USENIX Symposium on …, 2022 - usenix.org
Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation
principle to be highly resilient and extensible. In these systems, all cluster-management …

Gobench: A benchmark suite of real-world go concurrency bugs

T Yuan, G Li, J Lu, C Liu, L Li… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
Go, a fast growing programming language, is often considered as “the programming
language of the cloud”. The language provides a rich set of synchronization primitives …

Service-level fault injection testing

CS Meiklejohn, A Estrada, Y Song, H Miller… - Proceedings of the …, 2021 - dl.acm.org
Companies today increasingly rely on microservice architectures to deliver service for their
large-scale mobile or web applications. However, not all developers working on these …

Compiling distributed system models with PGo

F Hackett, S Hosseini, R Costa, M Do… - Proceedings of the 28th …, 2023 - dl.acm.org
Distributed systems are difficult to design and implement correctly. In response, both
research and industry are exploring applications of formal methods to distributed systems. A …

CoFI: Consistency-guided fault injection for cloud systems

H Chen, W Dou, D Wang, F Qin - Proceedings of the 35th IEEE/ACM …, 2020 - dl.acm.org
Network partitions are inevitable in large-scale cloud systems. Despite developer's efforts in
handling network partitions throughout designing, implementing and testing cloud systems …

SandTable: Scalable Distributed System Model Checking with Specification-Level State Exploration

R Tang, X Sun, Y Huang, Y Wei, L Ouyang… - Proceedings of the …, 2024 - dl.acm.org
Implementation-level distributed system model checkers (DMCKs) have proven valuable in
verifying the correctness of real distributed systems. However, they primarily focus on state …

Chronos: Finding timeout bugs in practical distributed systems by deep-priority fuzzing with transient delay

Y Chen, F Ma, Y Zhou, M Gu, Q Liao… - 2024 IEEE Symposium …, 2024 - ieeexplore.ieee.org
Delays are inevitable in complex distributed environments. Timeout mechanisms are
commonly used to handle unexpected failures in distributed systems. However, incorrect …