A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

Assess and summarize: Improve outage understanding with large language models

P **, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P **, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …

Interpretable failure localization for microservice systems based on graph autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P **… - ACM Transactions on …, 2025 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …

Knowledge-aware alert aggregation in large-scale cloud systems: a hybrid approach

J Kuang, J Liu, J Huang, R Zhong, J Gu, L Yu… - Proceedings of the 46th …, 2024 - dl.acm.org
Due to the scale and complexity of cloud systems, a system failure would trigger an" alert
storm", ie, massive correlated alerts. Although these alerts can be traced back to a few root …

APGNN: Alarm Propagation Graph Neural Network for fault detection and alarm root cause analysis

W Jiang, Y Bai - Computer Networks, 2023 - Elsevier
Telecommunication network plays an important role in our daily life. Fault detection and
alarm root cause analysis are the keys to ensure the normal operation of the network. To …

An intelligent framework for timely, accurate, and comprehensive cloud incident detection

Y Li, X Zhang, S He, Z Chen, Y Kang, J Liu… - ACM SIGOPS …, 2022 - dl.acm.org
Cloud incidents (service interruptions or performance degradation) dramatically degrade the
reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss …

Graph based incident extraction and diagnosis in large-scale online systems

Z He, P Chen, Y Luo, Q Yan, H Chen, G Yu… - Proceedings of the 37th …, 2022 - dl.acm.org
With the ever increasing scale and complexity of online systems, incidents are gradually
becoming commonplace. Without appropriate handling, they can seriously harm the system …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

No More Data Silos: Unified Microservice Failure Diagnosis with Temporal Knowledge Graph

S Zhang, Y Zhao, S **a, S Wei, Y Sun… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
Microservices improve the scalability and flexibility of monolithic architectures to
accommodate the evolution of software systems, but the complexity and dynamics of …