Eadro: An end-to-end troubleshooting framework for microservices on multi-source data
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …
Causal inference-based root cause analysis for online service systems with intervention recognition
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …
losses. In the field of online service systems, operators rely on enormous monitoring data to …
Interpretable failure localization for microservice systems based on graph autoencoder
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …
systems is of paramount importance. Unfortunately, prevailing methods face several …
Robust failure diagnosis of microservice system through multimodal data
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …
Incident-aware duplicate ticket aggregation for cloud systems
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …
revenue. When customers are affected by incidents, they often request customer support …
Microfi: Non-intrusive and prioritized request-level fault injection for microservice applications
Microservice is a widely-adopted architecture for constructing cloud-native applications. To
test application resiliency, chaos engineering is widely used to inject faults proactively in …
test application resiliency, chaos engineering is widely used to inject faults proactively in …
Maat: Performance metric anomaly anticipation for cloud services with conditional diffusion
Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …
Prism: Revealing hidden functional clusters from massive instances in cloud systems
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …
Cloud systems often rely on virtualization techniques to create instances of hardware …
The Vision of Autonomic Computing: Can LLMs Make It a Reality?
The Vision of Autonomic Computing (ACV), proposed over two decades ago, envisions
computing systems that self-manage akin to biological organisms, adapting seamlessly to …
computing systems that self-manage akin to biological organisms, adapting seamlessly to …
MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing
Microservice resilience, the ability of microservices to recover from failures and continue
providing reliable and responsive services, is crucial for cloud vendors. However, the current …
providing reliable and responsive services, is crucial for cloud vendors. However, the current …