Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S **a, W Fan, B Shi, X **ong… - ACM Transactions on …, 2024 - dl.acm.org
Widely adopted for their scalability and flexibility, modern microservice systems present
unique failure diagnosis challenges due to their independent deployment and dynamic …

Causal inference-based root cause analysis for online service systems with intervention recognition

M Li, Z Li, K Yin, X Nie, W Zhang, K Sui… - Proceedings of the 28th …, 2022 - dl.acm.org
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …

Root cause analysis for microservice systems via hierarchical reinforcement learning from human feedback

L Wang, C Zhang, R Ding, Y Xu, Q Chen… - Proceedings of the 29th …, 2023 - dl.acm.org
In microservice systems, the identification of root causes of anomalies is imperative for
service reliability and business impact. This process is typically divided into two phases:(i) …

Tracediag: Adaptive, interpretable, and efficient root cause analysis on large-scale microservice systems

R Ding, C Zhang, L Wang, Y Xu, M Ma, X Wu… - Proceedings of the 31st …, 2023 - dl.acm.org
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of
microservice systems. However, performing RCA on modern microservice systems can be …

Constructing large-scale real-world benchmark datasets for aiops

Z Li, N Zhao, S Zhang, Y Sun, P Chen, X Wen… - arxiv preprint arxiv …, 2022 - arxiv.org
Recently, AIOps (Artificial Intelligence for IT Operations) has been well studied in academia
and industry to enable automated and effective software service management. Plenty of …

Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A Review

R **n, J Wang, P Chen, Z Zhao - ACM Computing Surveys, 2025 - dl.acm.org
Performance diagnosis systems are defined as detecting abnormal performance
phenomena and play a crucial role in cloud applications. An effective performance …

An intelligent framework for timely, accurate, and comprehensive cloud incident detection

Y Li, X Zhang, S He, Z Chen, Y Kang, J Liu… - ACM SIGOPS …, 2022 - dl.acm.org
Cloud incidents (service interruptions or performance degradation) dramatically degrade the
reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss …

Conan: Diagnosing batch failures for cloud systems

L Li, X Zhang, S He, Y Kang, H Zhang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has
attracted tremendous attention from academia and industry over the last decade. In this …

Faultprofit: Hierarchical fault profiling of incident tickets in large-scale cloud systems

J Huang, J Liu, Z Chen, Z Jiang, Y Li, J Gu… - Proceedings of the 46th …, 2024 - dl.acm.org
Postmortem analysis is essential in the management of incidents within cloud systems,
which provides valuable insights to improve system's reliability and robustness. At CloudA1 …

Graph based incident extraction and diagnosis in large-scale online systems

Z He, P Chen, Y Luo, Q Yan, H Chen, G Yu… - Proceedings of the 37th …, 2022 - dl.acm.org
With the ever increasing scale and complexity of online systems, incidents are gradually
becoming commonplace. Without appropriate handling, they can seriously harm the system …