Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S **a, W Fan, B Shi, X **ong… - ACM Transactions on …, 2024 - dl.acm.org
Widely adopted for their scalability and flexibility, modern microservice systems present
unique failure diagnosis challenges due to their independent deployment and dynamic …

Root cause analysis of failures in microservices through causal discovery

A Ikram, S Chakraborty, S Mitra… - Advances in …, 2022 - proceedings.neurips.cc
Most cloud applications use a large number of smaller sub-components (called
microservices) that interact with each other in the form of a complex graph to provide the …

Automated root causing of cloud incidents using in-context learning with GPT-4

X Zhang, S Ghosh, C Bansal, R Wang, M Ma… - … Proceedings of the …, 2024 - dl.acm.org
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud
services, requiring on-call engineers to identify the primary issues and implement corrective …

Incremental causal graph learning for online root cause analysis

D Wang, Z Chen, Y Fu, Y Liu, H Chen - Proceedings of the 29th ACM …, 2023 - dl.acm.org
The task of root cause analysis (RCA) is to identify the root causes of system faults/failures
by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure …

Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection

L Pham, H Ha, H Zhang - Proceedings of the ACM on Software …, 2024 - dl.acm.org
Detecting failures and identifying their root causes promptly and accurately is crucial for
ensuring the availability of microservice systems. A typical failure troubleshooting pipeline …

Microservice root cause analysis with limited observability through intervention recognition in the latent space

Z **e, S Zhang, Y Geng, Y Zhang, M Ma, X Nie… - Proceedings of the 30th …, 2024 - dl.acm.org
Many failure root cause analysis (RCA) algorithms for microservices have been proposed
with the widespread adoption of microservices systems. Existing algorithms generally focus …

KGroot: A knowledge graph-enhanced method for root cause analysis

T Wang, G Qi, T Wu - Expert Systems with Applications, 2024 - Elsevier
Fault localization in online microservices is a challenging task due to the vast amount of
monitoring data, diversity of types and events, and complex interdependencies among …

Case studies of causal discovery from it monitoring time series

A Aït-Bachir, CK Assaad, C de Bignicourt… - arxiv preprint arxiv …, 2023 - arxiv.org
Information technology (IT) systems are vital for modern businesses, handling data storage,
communication, and process automation. Monitoring these systems is crucial for their proper …

MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems

L Zheng, Z Chen, J He, H Chen - Proceedings of the ACM on Web …, 2024 - dl.acm.org
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses,
and ensuring the smooth operation and management of complex systems. Previous data …