Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S **a, W Fan, B Shi, X **ong… - ACM Transactions on …, 2024 - dl.acm.org
Widely adopted for their scalability and flexibility, modern microservice systems present
unique failure diagnosis challenges due to their independent deployment and dynamic …

MULAN: multi-modal causal structure learning and root cause analysis for microservice systems

L Zheng, Z Chen, J He, H Chen - … of the ACM Web Conference 2024, 2024 - dl.acm.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …

A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Interpretable failure localization for microservice systems based on graph autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P **… - ACM Transactions on …, 2025 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …

Microservice root cause analysis with limited observability through intervention recognition in the latent space

Z **e, S Zhang, Y Geng, Y Zhang, M Ma, X Nie… - Proceedings of the 30th …, 2024 - dl.acm.org
Many failure root cause analysis (RCA) algorithms for microservices have been proposed
with the widespread adoption of microservices systems. Existing algorithms generally focus …

ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems

Y Sun, B Shi, M Mao, M Ma, S **a, S Zhang… - Proceedings of the 39th …, 2024 - dl.acm.org
Automated incident management is critical for large-scale microservice systems, including
tasks such as anomaly detection (AD), failure triage (FT), and root cause localization (RCL) …

Tracemesh: Scalable and streaming sampling for distributed traces

Z Chen, Z Jiang, Y Su, MR Lyu… - 2024 IEEE 17th …, 2024 - ieeexplore.ieee.org
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and
datacenter systems. It provides visibility into the full life cycle of a request or operation across …

Uac-ad: Unsupervised adversarial contrastive learning for anomaly detection on multi-modal data in microservice systems

H Liu, X Huang, M Jia, T Jia, J Han… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
To ensure the stability and reliability of microservice systems, timely and accurate anomaly
detection is of utmost importance. Recently, considering the lack of labels in real-world …

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

G Yu, P Chen, Z He, Q Yan, Y Luo, F Li… - Proceedings of the ACM …, 2024 - dl.acm.org
In large-scale online service systems, the occurrence of software changes is inevitable and
frequent. Despite rigorous pre-deployment testing practices, the presence of defective …

Trastrainer: Adaptive sampling for distributed traces with system runtime state

H Huang, X Zhang, P Chen, Z He, Z Chen… - Proceedings of the …, 2024 - dl.acm.org
Distributed tracing has been widely adopted in many microservice systems and plays an
important role in monitoring and analyzing the system. However, trace data often come in …