Assess and summarize: Improve outage understanding with large language models

P **, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

Prism: Revealing hidden functional clusters from massive instances in cloud systems

J Liu, Z Jiang, J Gu, J Huang, Z Chen… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …

Graph based incident extraction and diagnosis in large-scale online systems

Z He, P Chen, Y Luo, Q Yan, H Chen, G Yu… - Proceedings of the 37th …, 2022 - dl.acm.org
With the ever increasing scale and complexity of online systems, incidents are gradually
becoming commonplace. Without appropriate handling, they can seriously harm the system …

Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach

J Kuang, J Liu, J Huang, R Zhong, J Gu, L Yu… - Proceedings of the 46th …, 2024 - dl.acm.org
Due to the scale and complexity of cloud systems, a system failure would trigger an" alert
storm", ie, massive correlated alerts. Although these alerts can be traced back to a few root …

Graphweaver: Billion-scale cybersecurity incident correlation

S Freitas, A Gharib - Proceedings of the 33rd ACM International …, 2024 - dl.acm.org
In the dynamic landscape of large enterprise cybersecurity, accurately and efficiently
correlating billions of security alerts into comprehensive incidents is a substantial challenge …

A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning Operators

J Chen, C Jia, Y Yan, J Ge, H Zheng… - Proceedings of the ACM …, 2024 - dl.acm.org
Deep learning (DL) is a critical tool for real-world applications, and comprehensive testing of
DL models is vital to ensure their quality before deployment. However, recent studies have …

Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection

W Gu, J Liu, Z Chen, J Zhang, Y Su, J Gu… - arxiv preprint arxiv …, 2023 - arxiv.org
Performance issues permeate large-scale cloud service systems, which can lead to huge
revenue losses. To ensure reliable performance, it's essential to accurately identify and …

Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds towards consumer digital ecosystems

Y Xu, Z Qiu, H Gao, X Zhao, L Wang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Consumer digital ecosystems include a large volume of different types of applications, and
those applications are usually deployed in industrial cloud computing systems. Currently …