Assess and summarize: Improve outage understanding with large language models
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …
scalability. Each time cloud computing applications and services hosted on the cloud are …
Incident-aware duplicate ticket aggregation for cloud systems
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …
revenue. When customers are affected by incidents, they often request customer support …
A survey on intelligent management of alerts and incidents in IT services
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …
technologies, leading to a boost in system scales and complex dependencies among …
Prism: Revealing hidden functional clusters from massive instances in cloud systems
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …
Cloud systems often rely on virtualization techniques to create instances of hardware …
Graph based incident extraction and diagnosis in large-scale online systems
With the ever increasing scale and complexity of online systems, incidents are gradually
becoming commonplace. Without appropriate handling, they can seriously harm the system …
becoming commonplace. Without appropriate handling, they can seriously harm the system …
Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
Due to the scale and complexity of cloud systems, a system failure would trigger an" alert
storm", ie, massive correlated alerts. Although these alerts can be traced back to a few root …
storm", ie, massive correlated alerts. Although these alerts can be traced back to a few root …
Graphweaver: Billion-scale cybersecurity incident correlation
In the dynamic landscape of large enterprise cybersecurity, accurately and efficiently
correlating billions of security alerts into comprehensive incidents is a substantial challenge …
correlating billions of security alerts into comprehensive incidents is a substantial challenge …
A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning Operators
J Chen, C Jia, Y Yan, J Ge, H Zheng… - Proceedings of the ACM …, 2024 - dl.acm.org
Deep learning (DL) is a critical tool for real-world applications, and comprehensive testing of
DL models is vital to ensure their quality before deployment. However, recent studies have …
DL models is vital to ensure their quality before deployment. However, recent studies have …
Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection
Performance issues permeate large-scale cloud service systems, which can lead to huge
revenue losses. To ensure reliable performance, it's essential to accurately identify and …
revenue losses. To ensure reliable performance, it's essential to accurately identify and …
Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds towards consumer digital ecosystems
Consumer digital ecosystems include a large volume of different types of applications, and
those applications are usually deployed in industrial cloud computing systems. Currently …
those applications are usually deployed in industrial cloud computing systems. Currently …