Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Recommending root-cause and mitigation steps for cloud incidents using large language models

T Ahmed, S Ghosh, C Bansal… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Incident management for cloud services is a complex process involving several steps and
has a huge impact on both service health and developer productivity. On-call engineers …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

Detection is better than cure: A cloud incidents perspective

V Ganatra, A Parayil, S Ghosh, Y Kang, M Ma… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud providers use automated watchdogs or monitors to continuously observe service
availability and to proactively report incidents when system performance degrades. Improper …

KGroot: A knowledge graph-enhanced method for root cause analysis

T Wang, G Qi, T Wu - Expert Systems with Applications, 2024 - Elsevier
Fault localization in online microservices is a challenging task due to the vast amount of
monitoring data, diversity of types and events, and complex interdependencies among …

FAIL: Analyzing Software Failures from the News Using LLMs

D Anandayuvaraj, M Campbell, A Tewari… - Proceedings of the 39th …, 2024 - dl.acm.org
Software failures inform engineering work, standards, regulations. For example, the Log4J
vulnerability brought government and industry attention to evaluating and securing software …

Autotsg: learning and synthesis for incident troubleshooting

M Shetty, C Bansal, SP Upadhyayula… - Proceedings of the 30th …, 2022 - dl.acm.org
Incident management is a key aspect of operating large-scale cloud services. To aid with
faster and efficient resolution of incidents, engineering teams document frequent …

Studying the characteristics of AIOps projects on GitHub

R Aghili, H Li, F Khomh - Empirical Software Engineering, 2023 - Springer
Abstract Artificial Intelligence for IT Operations (AIOps) leverages AI approaches to handle
the massive amount of data generated during the operations of software systems. Prior …

ESRO: Experience Assisted Service Reliability against Outages

S Chakraborty, S Agarwal, S Garg… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Modern cloud services are prone to failures due to their complex architecture, making
diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging …

Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental Study

K Sarda, Z Namrud, M Litoiu, L Shwartz… - … Proceedings of the 32nd …, 2024 - dl.acm.org
Runtime auto-remediation is crucial for ensuring the reliability and efficiency of distributed
systems, especially within complex microservice-based applications. However, the …