Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning

C Zhang, X Peng, C Sha, K Zhang, Z Fu, X Wu… - Proceedings of the 44th …, 2022 - dl.acm.org
A microservice system in industry is usually a large-scale distributed system consisting of
dozens to thousands of services running in different machines. An anomaly of the system …

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

Causal inference-based root cause analysis for online service systems with intervention recognition

M Li, Z Li, K Yin, X Nie, W Zhang, K Sui… - Proceedings of the 28th …, 2022 - dl.acm.org
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …

Actionable and interpretable fault localization for recurring failures in online service systems

Z Li, N Zhao, M Li, X Lu, L Wang, D Chang… - Proceedings of the 30th …, 2022 - dl.acm.org
Fault localization is challenging in an online service system due to its monitoring data's large
volume and variety and complex dependencies across/within its components (eg, services …

Incremental causal graph learning for online root cause analysis

D Wang, Z Chen, Y Fu, Y Liu, H Chen - Proceedings of the 29th ACM …, 2023 - dl.acm.org
The task of root cause analysis (RCA) is to identify the root causes of system faults/failures
by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure …

Interdependent causal networks for root cause localization

D Wang, Z Chen, J Ni, L Tong, Z Wang, Y Fu… - Proceedings of the 29th …, 2023 - dl.acm.org
The goal of root cause analysis is to identify the underlying causes of system problems by
discovering and analyzing the causal structure from system monitoring data. It is …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao… - arxiv preprint arxiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …