A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

Assess and summarize: Improve outage understanding with large language models

P **, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

Knowledge-aware alert aggregation in large-scale cloud systems: a hybrid approach

J Kuang, J Liu, J Huang, R Zhong, J Gu, L Yu… - Proceedings of the 46th …, 2024 - dl.acm.org
Due to the scale and complexity of cloud systems, a system failure would trigger an" alert
storm", ie, massive correlated alerts. Although these alerts can be traced back to a few root …

A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning Operators

J Chen, C Jia, Y Yan, J Ge, H Zheng… - Proceedings of the ACM …, 2024 - dl.acm.org
Deep learning (DL) is a critical tool for real-world applications, and comprehensive testing of
DL models is vital to ensure their quality before deployment. However, recent studies have …

Faultprofit: Hierarchical fault profiling of incident tickets in large-scale cloud systems

J Huang, J Liu, Z Chen, Z Jiang, Y Li, J Gu… - Proceedings of the 46th …, 2024 - dl.acm.org
Postmortem analysis is essential in the management of incidents within cloud systems,
which provides valuable insights to improve system's reliability and robustness. At CloudA1 …

Graph based incident extraction and diagnosis in large-scale online systems

Z He, P Chen, Y Luo, Q Yan, H Chen, G Yu… - Proceedings of the 37th …, 2022 - dl.acm.org
With the ever increasing scale and complexity of online systems, incidents are gradually
becoming commonplace. Without appropriate handling, they can seriously harm the system …

Tracemesh: Scalable and streaming sampling for distributed traces

Z Chen, Z Jiang, Y Su, MR Lyu… - 2024 IEEE 17th …, 2024 - ieeexplore.ieee.org
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and
datacenter systems. It provides visibility into the full life cycle of a request or operation across …

Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds toward consumer digital ecosystems

Y Xu, Z Qiu, H Gao, X Zhao, L Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Consumer digital ecosystems include a large volume of different types of applications, and
those applications are usually deployed in industrial cloud computing systems. Currently …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

Prism: Revealing hidden functional clusters from massive instances in cloud systems

J Liu, Z Jiang, J Gu, J Huang, Z Chen… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …