A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey

J Diaz-De-Arcaya, AI Torre-Bastida, G Zárate… - ACM Computing …, 2023 - dl.acm.org
Data science projects represent a greater challenge than software engineering for
organizations pursuing their adoption. The diverse stakeholders involved emphasize the …

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Z Wang, Z Liu, Y Zhang, A Zhong, J Wang… - Proceedings of the 33rd …, 2024 - dl.acm.org
Large language model (LLM) applications in cloud root cause analysis (RCA) have been
actively explored recently. However, current methods are still reliant on manual workflow …

A survey of aiops for failure management in the era of large language models

L Zhang, T Jia, M Jia, Y Wu, A Liu, Y Yang, Z Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
As software systems grow increasingly intricate, Artificial Intelligence for IT Operations
(AIOps) methods have been widely used in software system failure management to ensure …

KGroot: A knowledge graph-enhanced method for root cause analysis

T Wang, G Qi, T Wu - Expert Systems with Applications, 2024 - Elsevier
Fault localization in online microservices is a challenging task due to the vast amount of
monitoring data, diversity of types and events, and complex interdependencies among …

[HTML][HTML] Navigating the DevOps landscape

X Zhang, P Zhao, J Jaskolka - Journal of Systems and Software, 2025 - Elsevier
Context: DevOps, with its increasing prevalence in both industry and academia, has evolved
into various DevOps variants (namely XOps) to address emerging technological and …

Adopting artificial intelligence technology for network operations in digital transformation

S Min, B Kim - Administrative Sciences, 2024 - mdpi.com
This study aims to define factors that affect Artificial Intelligence (AI) technology introduction
to network operations and analyze the relative importance of such factors. Based on this …

Adarma auto-detection and auto-remediation of microservice anomalies by leveraging large language models

K Sarda, Z Namrud, R Rouf, H Ahuja… - Proceedings of the 33rd …, 2023 - dl.acm.org
In microservice architecture, anomalies can cause slow response times or poor user
experience if not detected early. Manual detection can be time-consuming and error-prone …

RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information

G No, Y Lee, H Kang, P Kang - arxiv preprint arxiv:2311.05160, 2023 - arxiv.org
As the IT industry advances, system log data becomes increasingly crucial. Many computer
systems rely on log texts for management due to restricted access to source code. The need …

Lightweight Multi-task Learning Method for System Log Anomaly Detection

TA Pham, JH Lee - IEEE Access, 2024 - ieeexplore.ieee.org
Log anomaly detection is a crucial task in monitoring IT systems along with metrics and
traces. An anomaly could be detected by either one of two types of logs: individual logs or …

ESRO: Experience Assisted Service Reliability against Outages

S Chakraborty, S Agarwal, S Garg… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Modern cloud services are prone to failures due to their complex architecture, making
diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging …