Automatic root cause analysis via large language models for cloud incidents
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …
Anvil: Verifying liveness of cluster management controllers
Modern clouds depend crucially on an extensible ecosystem of thousands of controllers,
each managing critical systems (eg, a ZooKeeper cluster). A controller continuously …
each managing critical systems (eg, a ZooKeeper cluster). A controller continuously …
Acto: Automatic end-to-end testing for operation correctness of cloud system management
Cloud systems are increasingly being managed by operation programs termed operators,
which automate tedious, human-based operations. Operators of modern management …
which automate tedious, human-based operations. Operators of modern management …
Randomized testing of byzantine fault tolerant algorithms
Byzantine fault-tolerant algorithms promise agreement on a correct value, even if a subset of
processes can deviate from the algorithm arbitrarily. While these algorithms provide strong …
processes can deviate from the algorithm arbitrarily. While these algorithms provide strong …
If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems
Retry---the re-execution of a task on failure---is a common mechanism to enable resilient
software systems. Yet, despite its commonality and long history, retry remains difficult to …
software systems. Yet, despite its commonality and long history, retry remains difficult to …
When your infrastructure is a buggy program: Understanding faults in infrastructure as code ecosystems
Modern applications have become increasingly complex and their manual installation and
configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as …
configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as …
{Push-Button} reliability testing for {Cloud-Backed} applications with rainmaker
Modern applications have been emerging towards a cloud-based programming model
where applications depend on cloud services for various functionalities. Such “cloud native” …
where applications depend on cloud services for various functionalities. Such “cloud native” …
Aegis: Attribution of control plane change impact across layers and components for cloud systems
Modern cloud control plane infrastructure like Microsoft Azure has evolved into a complex
one to serve customer needs for diverse types of services and adequate cloud-based …
one to serve customer needs for diverse types of services and adequate cloud-based …
SandTable: Scalable Distributed System Model Checking with Specification-Level State Exploration
Implementation-level distributed system model checkers (DMCKs) have proven valuable in
verifying the correctness of real distributed systems. However, they primarily focus on state …
verifying the correctness of real distributed systems. However, they primarily focus on state …
Mutiny! How does Kubernetes fail, and what can we do about it?
In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular
container orchestration system), ii) develop a framework to perform a fault/error injection …
container orchestration system), ii) develop a framework to perform a fault/error injection …