Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection

Y Chen, C Zhang, M Ma, Y Liu, R Ding, B Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Anomaly detection in multivariate time series data is of paramount importance for ensuring
the efficient operation of large-scale systems across diverse domains. However, accurately …

Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

Monitorassistant: Simplifying cloud service monitoring via large language models

Z Yu, M Ma, C Zhang, S Qin, Y Kang, C Bansal… - … Proceedings of the …, 2024 - dl.acm.org
In large-scale cloud service systems, monitoring metric data and conducting anomaly
detection is an important way to maintain reliability and stability. However, great disparity …

Assess and summarize: Improve outage understanding with large language models

P **, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao… - arxiv preprint arxiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

M Shetty, Y Chen, G Somashekar, M Ma… - Proceedings of the …, 2024 - dl.acm.org
The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of
software development and deployment is revolutionizing the information technology …

Large Language Models Can Provide Accurate and Interpretable Incident Triage

Z Wang, J Li, M Ma, Z Li, Y Kang… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Large-scale cloud services frequently experience incidents that can have a significant
impact on their stability. Incident triage is a critical process that assigns incidents to …

Augmenting Automatic Root-Cause Identification with Incident Alerts Using LLM

K Sarda, Z Namrud, I Watts, L Shwartz… - … in Software and …, 2024 - ieeexplore.ieee.org
Ensuring the reliability and availability of cloud services relies heavily on efficient root cause
analysis (RCA) for cloud incidents. Traditionally, RCA involved labor-intensive manual …

Variational Autoencoder and Graph Attention Root Cause Localization Model Based on Log Data and Graph Structure

J Ding, Y Yan, J Wang, T Chen - International Conference on Intelligent …, 2024 - Springer
When conducting root cause localization, converting data into graph structures for feature
extraction can represent complex dependency relationships among data more …