Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

Causal inference-based root cause analysis for online service systems with intervention recognition

M Li, Z Li, K Yin, X Nie, W Zhang, K Sui… - Proceedings of the 28th …, 2022 - dl.acm.org
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P **, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …

Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P **… - ACM Transactions on …, 2024 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

Maat: Performance metric anomaly anticipation for cloud services with conditional diffusion

C Lee, T Yang, Z Chen, Y Su… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …

Prism: Revealing hidden functional clusters from massive instances in cloud systems

J Liu, Z Jiang, J Gu, J Huang, Z Chen… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …

The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Z Zhang, F Yang, X Qin, J Zhang, Q Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
The Vision of Autonomic Computing (ACV), proposed over two decades ago, envisions
computing systems that self-manage akin to biological organisms, adapting seamlessly to …

MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing

T Yang, C Lee, J Shen, Y Su, C Feng, Y Yang… - Proceedings of the 33rd …, 2024 - dl.acm.org
Microservice resilience, the ability of microservices to recover from failures and continue
providing reliable and responsive services, is crucial for cloud vendors. However, the current …

Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis

J Huang, Z Jiang, J Liu, Y Huo, J Gu… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Logs are imperative in the maintenance of online service systems, which often encompass
important information for effective failure mitigation. While existing anomaly detection …