Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

Automated root causing of cloud incidents using in-context learning with GPT-4

X Zhang, S Ghosh, C Bansal, R Wang, M Ma… - … Proceedings of the …, 2024 - dl.acm.org
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud
services, requiring on-call engineers to identify the primary issues and implement corrective …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao… - arxiv preprint arxiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A Review

R **n, J Wang, P Chen, Z Zhao - ACM Computing Surveys, 2025 - dl.acm.org
Performance diagnosis systems are defined as detecting abnormal performance
phenomena and play a crucial role in cloud applications. An effective performance …

Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection

L Pham, H Ha, H Zhang - Proceedings of the ACM on Software …, 2024 - dl.acm.org
Detecting failures and identifying their root causes promptly and accurately is crucial for
ensuring the availability of microservice systems. A typical failure troubleshooting pipeline …

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

Robust multimodal failure detection for microservice systems

C Zhao, M Ma, Z Zhong, S Zhang, Z Tan… - Proceedings of the 29th …, 2023 - dl.acm.org
Proactive failure detection of instances is vitally essential to microservice systems because
an instance failure can propagate to the whole system and degrade the system's …

Interpretable failure localization for microservice systems based on graph autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P **… - ACM Transactions on …, 2024 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …