Automatic root cause analysis via large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Anvil: Verifying liveness of cluster management controllers

X Sun, W Ma, JT Gu, Z Ma, T Chajed, J Howell… - … USENIX Symposium on …, 2024 - usenix.org
Modern clouds depend crucially on an extensible ecosystem of thousands of controllers,
each managing critical systems (eg, a ZooKeeper cluster). A controller continuously …

Acto: Automatic end-to-end testing for operation correctness of cloud system management

JT Gu, X Sun, W Zhang, Y Jiang, C Wang… - Proceedings of the 29th …, 2023 - dl.acm.org
Cloud systems are increasingly being managed by operation programs termed operators,
which automate tedious, human-based operations. Operators of modern management …

Randomized testing of byzantine fault tolerant algorithms

LN Winter, F Buse, D De Graaf… - Proceedings of the …, 2023 - dl.acm.org
Byzantine fault-tolerant algorithms promise agreement on a correct value, even if a subset of
processes can deviate from the algorithm arbitrarily. While these algorithms provide strong …

If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

BA Stoica, U Sethi, Y Su, C Zhou, S Lu, J Mace… - Proceedings of the …, 2024 - dl.acm.org
Retry---the re-execution of a task on failure---is a common mechanism to enable resilient
software systems. Yet, despite its commonality and long history, retry remains difficult to …

When your infrastructure is a buggy program: Understanding faults in infrastructure as code ecosystems

GP Drosos, T Sotiropoulos, G Alexopoulos… - Proceedings of the …, 2024 - dl.acm.org
Modern applications have become increasingly complex and their manual installation and
configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as …

{Push-Button} reliability testing for {Cloud-Backed} applications with rainmaker

Y Chen, X Sun, S Nath, Z Yang, T Xu - 20th USENIX Symposium on …, 2023 - usenix.org
Modern applications have been emerging towards a cloud-based programming model
where applications depend on cloud services for various functionalities. Such “cloud native” …

Aegis: Attribution of control plane change impact across layers and components for cloud systems

X Yan, K Hsieh, Y Liyanage, M Ma… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Modern cloud control plane infrastructure like Microsoft Azure has evolved into a complex
one to serve customer needs for diverse types of services and adequate cloud-based …

SandTable: Scalable Distributed System Model Checking with Specification-Level State Exploration

R Tang, X Sun, Y Huang, Y Wei, L Ouyang… - Proceedings of the …, 2024 - dl.acm.org
Implementation-level distributed system model checkers (DMCKs) have proven valuable in
verifying the correctness of real distributed systems. However, they primarily focus on state …

Mutiny! How does Kubernetes fail, and what can we do about it?

M Barletta, M Cinque, C Di Martino… - 2024 54th Annual …, 2024 - ieeexplore.ieee.org
In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular
container orchestration system), ii) develop a framework to perform a fault/error injection …