Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

A survey of graph-based deep learning for anomaly detection in distributed systems

AD Pazho, GA Noghre, AA Purkayastha… - … on Knowledge and …, 2023 - ieeexplore.ieee.org
Anomaly detection is a crucial task in complex distributed systems. A thorough
understanding of the requirements and challenges of anomaly detection is pivotal to the …

Twin graph-based anomaly detection via attentive multi-modal learning for microservice system

J Huang, Y Yang, H Yu, J Li… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Microservice architecture has sprung up over recent years for managing enterprise
applications, due to its ability to independently deploy and scale services. Despite its …

Knowledge-aware alert aggregation in large-scale cloud systems: a hybrid approach

J Kuang, J Liu, J Huang, R Zhong, J Gu, L Yu… - Proceedings of the 46th …, 2024 - dl.acm.org
Due to the scale and complexity of cloud systems, a system failure would trigger an" alert
storm", ie, massive correlated alerts. Although these alerts can be traced back to a few root …

On the influence of data resampling for deep learning-based log anomaly detection: Insights and recommendations

X Ma, H Zou, P He, J Keung, Y Li… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Numerous Deep Learning (DL)-based approaches have gained attention in software Log
Anomaly Detection (LAD), yet class imbalance in training data remains a challenge, with …

ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems

Y Sun, B Shi, M Mao, M Ma, S **a, S Zhang… - Proceedings of the 39th …, 2024 - dl.acm.org
Automated incident management is critical for large-scale microservice systems, including
tasks such as anomaly detection (AD), failure triage (FT), and root cause localization (RCL) …

Instantops: A joint approach to system failure prediction and root cause identification in microserivces cloud-native applications

R Rouf, M Rasolroveicy, M Litoiu, S Nagar… - Proceedings of the 15th …, 2024 - dl.acm.org
As microservice and cloud computing operations increasingly adopt automation, the
importance of models for fostering resilient and efficient adaptive architectures becomes …

Maat: Performance metric anomaly anticipation for cloud services with conditional diffusion

C Lee, T Yang, Z Chen, Y Su… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …

Uac-ad: Unsupervised adversarial contrastive learning for anomaly detection on multi-modal data in microservice systems

H Liu, X Huang, M Jia, T Jia, J Han… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
To ensure the stability and reliability of microservice systems, timely and accurate anomaly
detection is of utmost importance. Recently, considering the lack of labels in real-world …

No More Data Silos: Unified Microservice Failure Diagnosis with Temporal Knowledge Graph

S Zhang, Y Zhao, S **a, S Wei, Y Sun… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
Microservices improve the scalability and flexibility of monolithic architectures to
accommodate the evolution of software systems, but the complexity and dynamics of …