A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey

J Diaz-De-Arcaya, AI Torre-Bastida, G Zárate… - ACM Computing …, 2023 - dl.acm.org
Data science projects represent a greater challenge than software engineering for
organizations pursuing their adoption. The diverse stakeholders involved emphasize the …

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Z Wang, Z Liu, Y Zhang, A Zhong, J Wang… - Proceedings of the 33rd …, 2024 - dl.acm.org
Large language model (LLM) applications in cloud root cause analysis (RCA) have been
actively explored recently. However, current methods are still reliant on manual workflow …

A survey of aiops for failure management in the era of large language models

L Zhang, T Jia, M Jia, Y Wu, A Liu, Y Yang, Z Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
As software systems grow increasingly intricate, Artificial Intelligence for IT Operations
(AIOps) methods have been widely used in software system failure management to ensure …

Adopting artificial intelligence technology for network operations in digital transformation

S Min, B Kim - Administrative Sciences, 2024 - mdpi.com
This study aims to define factors that affect Artificial Intelligence (AI) technology introduction
to network operations and analyze the relative importance of such factors. Based on this …

Adarma auto-detection and auto-remediation of microservice anomalies by leveraging large language models

K Sarda, Z Namrud, R Rouf, H Ahuja… - Proceedings of the 33rd …, 2023 - dl.acm.org
In microservice architecture, anomalies can cause slow response times or poor user
experience if not detected early. Manual detection can be time-consuming and error-prone …

Training-free retrieval-based log anomaly detection with pre-trained language model considering token-level information

G No, Y Lee, H Kang, P Kang - Engineering Applications of Artificial …, 2024 - Elsevier
As the information technology industry advances, the demand for log anomaly detection,
based solely on printed log text, is growing. However, identifying anomalies in rapidly …

[HTML][HTML] Efficient resource utilization in IoT and cloud computing

VK Prasad, D Dansana, MD Bhavsar, B Acharya… - Information, 2023 - mdpi.com
With the proliferation of IoT devices, there has been exponential growth in data generation,
placing substantial demands on both cloud computing (CC) and internet infrastructure. CC …

KGroot: A knowledge graph-enhanced method for root cause analysis

T Wang, G Qi, T Wu - Expert Systems with Applications, 2024 - Elsevier
Fault localization in online microservices is a challenging task due to the vast amount of
monitoring data, diversity of types and events, and complex interdependencies among …

Pyrca: A library for metric-based root cause analysis

C Liu, W Yang, H Mittal, M Singh, D Sahoo… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce PyRCA, an open-source Python machine learning library of Root Cause
Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps). It provides a holistic …

[HTML][HTML] Navigating the DevOps landscape

X Zhang, P Zhao, J Jaskolka - Journal of Systems and Software, 2025 - Elsevier
Context: DevOps, with its increasing prevalence in both industry and academia, has evolved
into various DevOps variants (namely XOps) to address emerging technological and …