Eadro: An end-to-end troubleshooting framework for microservices on multi-source data
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …
Causal inference-based root cause analysis for online service systems with intervention recognition
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …
losses. In the field of online service systems, operators rely on enormous monitoring data to …
Robust failure diagnosis of microservice system through multimodal data
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …
Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …
systems is of paramount importance. Unfortunately, prevailing methods face several …
Incident-aware duplicate ticket aggregation for cloud systems
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …
revenue. When customers are affected by incidents, they often request customer support …
Maat: Performance metric anomaly anticipation for cloud services with conditional diffusion
Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …
Prism: Revealing hidden functional clusters from massive instances in cloud systems
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …
Cloud systems often rely on virtualization techniques to create instances of hardware …
The Vision of Autonomic Computing: Can LLMs Make It a Reality?
The Vision of Autonomic Computing (ACV), proposed over two decades ago, envisions
computing systems that self-manage akin to biological organisms, adapting seamlessly to …
computing systems that self-manage akin to biological organisms, adapting seamlessly to …
MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing
Microservice resilience, the ability of microservices to recover from failures and continue
providing reliable and responsive services, is crucial for cloud vendors. However, the current …
providing reliable and responsive services, is crucial for cloud vendors. However, the current …
Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis
Logs are imperative in the maintenance of online service systems, which often encompass
important information for effective failure mitigation. While existing anomaly detection …
important information for effective failure mitigation. While existing anomaly detection …