Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S **a, W Fan, B Shi, X **ong… - ACM Transactions on …, 2024 - dl.acm.org
Widely adopted for their scalability and flexibility, modern microservice systems present
unique failure diagnosis challenges due to their independent deployment and dynamic …

MULAN: multi-modal causal structure learning and root cause analysis for microservice systems

L Zheng, Z Chen, J He, H Chen - … of the ACM Web Conference 2024, 2024 - dl.acm.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P **, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …

Interpretable failure localization for microservice systems based on graph autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P **… - ACM Transactions on …, 2025 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …

HeMiRCA: Fine-grained root cause analysis for microservices with heterogeneous data sources

Z Zhu, C Lee, X Tang, P He - ACM Transactions on Software …, 2024 - dl.acm.org
Microservices architecture improves software scalability, resilience, and agility but also
poses significant challenges to system reliability due to their complexity and dynamic nature …

ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems

Y Sun, B Shi, M Mao, M Ma, S **a, S Zhang… - Proceedings of the 39th …, 2024 - dl.acm.org
Automated incident management is critical for large-scale microservice systems, including
tasks such as anomaly detection (AD), failure triage (FT), and root cause localization (RCL) …

An empirical study on change-induced incidents of online service systems

Y Wu, B Chai, Y Li, B Liu, J Li, Y Yang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Although dedicated efforts have been devoted to ensuring the service quality of online
service systems, these systems are still suffering from incidents due to various causes, which …

ServiceAnomaly: An anomaly detection approach in microservices using distributed traces and profiling metrics

M Panahandeh, A Hamou-Lhadj, M Hamdaqa… - Journal of Systems and …, 2024 - Elsevier
Anomaly detection is an essential activity for identifying abnormal behaviours in
microservice-based systems. A common approach is to model the system behaviour during …

Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph

Z Yao, C Pei, W Chen, H Wang, L Su, H Jiang… - … Proceedings of the …, 2024 - dl.acm.org
This paper presents Chain-of-Event (CoE), an interpretable model for root cause analysis in
microservice systems that analyzes causal relationships of events transformed from multi …