Cloud-native computing: A survey from the perspective of services

S Deng, H Zhao, B Huang, C Zhang… - Proceedings of the …, 2024 - ieeexplore.ieee.org
The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …

Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

Assess and summarize: Improve outage understanding with large language models

P **, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

Detection is better than cure: A cloud incidents perspective

V Ganatra, A Parayil, S Ghosh, Y Kang, M Ma… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud providers use automated watchdogs or monitors to continuously observe service
availability and to proactively report incidents when system performance degrades. Improper …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

An intelligent framework for timely, accurate, and comprehensive cloud incident detection

Y Li, X Zhang, S He, Z Chen, Y Kang, J Liu… - ACM SIGOPS …, 2022 - dl.acm.org
Cloud incidents (service interruptions or performance degradation) dramatically degrade the
reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss …

Understanding and predicting incident mitigation time

W Wang, J Chen, L Yang, H Zhang, Z Wang - Information and Software …, 2023 - Elsevier
Context: Incident management plays a significant role in online service systems. Incidents
should be mitigated as soon as possible in order to achieve high service stability. However …

Prism: Revealing hidden functional clusters from massive instances in cloud systems

J Liu, Z Jiang, J Gu, J Huang, Z Chen… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …