Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

C Lee, T Yang, Z Chen, Y Su… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
The complexity and dynamism of microservices pose significant challenges to system
reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization …

Causal inference-based root cause analysis for online service systems with intervention recognition

M Li, Z Li, K Yin, X Nie, W Zhang, K Sui… - Proceedings of the 28th …, 2022 - dl.acm.org
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic
losses. In the field of online service systems, operators rely on enormous monitoring data to …

Interpretable failure localization for microservice systems based on graph autoencoder

Y Sun, Z Lin, B Shi, S Zhang, S Ma, P **… - ACM Transactions on …, 2024 - dl.acm.org
Accurate and efficient localization of root cause instances in large-scale microservice
systems is of paramount importance. Unfortunately, prevailing methods face several …

Robust failure diagnosis of microservice system through multimodal data

S Zhang, P **, Z Lin, Y Sun, B Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure
diagnosis methods rely solely on single-modal data (ie, using either metrics, logs, or traces) …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

Microfi: Non-intrusive and prioritized request-level fault injection for microservice applications

H Chen, P Chen, G Yu, X Li, Z He - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Microservice is a widely-adopted architecture for constructing cloud-native applications. To
test application resiliency, chaos engineering is widely used to inject faults proactively in …

Maat: Performance metric anomaly anticipation for cloud services with conditional diffusion

C Lee, T Yang, Z Chen, Y Su… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly
detection followed by diagnosis. Existing techniques for anomaly detection focus solely on …

Prism: Revealing hidden functional clusters from massive instances in cloud systems

J Liu, Z Jiang, J Gu, J Huang, Z Chen… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Ensuring the reliability of cloud systems is critical for both cloud vendors and customers.
Cloud systems often rely on virtualization techniques to create instances of hardware …

The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Z Zhang, F Yang, X Qin, J Zhang, Q Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
The Vision of Autonomic Computing (ACV), proposed over two decades ago, envisions
computing systems that self-manage akin to biological organisms, adapting seamlessly to …

MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing

T Yang, C Lee, J Shen, Y Su, C Feng, Y Yang… - Proceedings of the 33rd …, 2024 - dl.acm.org
Microservice resilience, the ability of microservices to recover from failures and continue
providing reliable and responsive services, is crucial for cloud vendors. However, the current …