Pre-trained kpi anomaly detection model through disentangled transformer

Z Yu, C Pei, X Wang, M Ma, C Bansal… - Proceedings of the 30th …, 2024 - dl.acm.org
In large-scale online service systems, numerous Key Performance Indicators (KPIs), such as
service response time and error rate, are gathered in a time-series format. KPI Anomaly …

End-to-end automl for unsupervised log anomaly detection

S Zhang, Y Ji, J Luan, X Nie, Z Chen, M Ma… - Proceedings of the 39th …, 2024 - dl.acm.org
As modern software systems evolve towards greater complexity, ensuring their reliable
operation has become a critical challenge. Log data analysis is vital in maintaining system …

Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization

L Tao, S Zhang, Z Jia, J Sun, M Ma, Z Li, Y Sun… - Proceedings of the 39th …, 2024 - dl.acm.org
Microservice systems are inherently complex and prone to failures, which can significantly
impact user experience. Existing diagnostic approaches based on single-modal data such …

Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning

H Li, M Ma, Y Liu, P Zhao, S Li, Z Li… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
In the rapidly expanding domain of cloud computing, a variety of software services have
been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on …

Enabling Autonomic Microservice Management through Self-Learning Agents

F Yu, F Yang, X Qin, Z Zhang, J Zhang, Q Lin… - arxiv preprint arxiv …, 2025 - arxiv.org
The increasing complexity of modern software systems necessitates robust autonomic self-
management capabilities. While Large Language Models (LLMs) demonstrate potential in …

Large Language Models Can Provide Accurate and Interpretable Incident Triage

Z Wang, J Li, M Ma, Z Li, Y Kang… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Large-scale cloud services frequently experience incidents that can have a significant
impact on their stability. Incident triage is a critical process that assigns incidents to …

A Survey on Large Language Models for Communication, Network, and Service Management: Application Insights, Challenges, and Future Directions

GO Boateng, H Sami, A Alagha, H Elmekki… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid evolution of communication networks in recent decades has intensified the need
for advanced Network and Service Management (NSM) strategies to address the growing …

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

Y Chen, M Shetty, G Somashekar, M Ma… - arxiv preprint arxiv …, 2025 - arxiv.org
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault
localization and root cause analysis, to reduce human workload and minimize customer …

Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction

Y Liu, M Ma, P Zhao, T Li, B Qiao, S Li… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
As cloud service continues to dominate various sectors, the reliability of cloud infrastructures
becomes crucial. Traditional methods of failure prediction often fall short in providing …

Empowering AIOps: Leveraging Large Language Models for IT Operations Management

A Vitui, TH Chen - arxiv preprint arxiv:2501.12461, 2025 - arxiv.org
The integration of Artificial Intelligence (AI) into IT Operations Management (ITOM),
commonly referred to as AIOps, offers substantial potential for automating workflows …