Automatic root cause analysis via large language models for cloud incidents

Y Chen, H **e, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization

L Chen, J Lingys, K Chen, F Liu - Proceedings of the 2018 conference of …, 2018 - dl.acm.org
Traffic optimizations (TO, eg flow scheduling, load balancing) in datacenters are difficult
online decision-making problems. Previously, they are done with heuristics relying on …

Language-directed hardware design for network performance monitoring

S Narayana, A Sivaraman, V Nathan, P Goyal… - Proceedings of the …, 2017 - dl.acm.org
Network performance monitoring today is restricted by existing switch support for
measurement, forcing operators to rely heavily on endpoints with poor visibility into the …

A survey on big data for network traffic monitoring and analysis

A D'Alconzo, I Drago, A Morichetta… - … on Network and …, 2019 - ieeexplore.ieee.org
Network Traffic Monitoring and Analysis (NTMA) represents a key component for network
management, especially to guarantee the correct operation of large-scale networks such as …

Flow event telemetry on programmable data plane

Y Zhou, C Sun, HH Liu, R Miao, S Bai, B Li… - Proceedings of the …, 2020 - dl.acm.org
Network performance anomalies (NPAs), eg long-tailed latency, bandwidth decline, etc., are
increasingly crucial to cloud providers as applications are getting more sensitive to …

Next-generation data center network enabled by machine learning: Review, challenges, and opportunities

H Dong, A Munir, H Tout, Y Ganjali - IEEE Access, 2021 - ieeexplore.ieee.org
Data center network (DCN) is the backbone of many emerging applications from smart
connected homes to smart traffic control and is continuously evolving to meet the diverse …

Diagnosing root causes of intermittent slow queries in cloud databases

M Ma, Z Yin, S Zhang, S Wang, C Zheng… - Proceedings of the …, 2020 - dl.acm.org
With the growing market of cloud databases, careful detection and elimination of slow
queries are of great importance to service stability. Previous studies focus on optimizing the …

From luna to solar: the evolutions of the compute-to-storage networks in alibaba cloud

R Miao, L Zhu, S Ma, K Qian, S Zhuang, B Li… - Proceedings of the …, 2022 - dl.acm.org
This paper presents the two generations of storage network stacks that reduced the average
I/O latency of Alibaba Cloud's EBS service by 72% in the last five years: Luna, a user-space …

{NetBouncer}: Active device and link failure localization in data center networks

C Tan, Z **, C Guo, T Zhang, H Wu, K Deng… - … USENIX Symposium on …, 2019 - usenix.org
The availability of data center services is jeopardized by various network incidents. One of
the biggest challenges for network incident handling is to accurately localize the failures …

pforest: In-network inference with random forests

C Busse-Grawitz, R Meier, A Dietmüller… - arxiv preprint arxiv …, 2019 - arxiv.org
When classifying network traffic, a key challenge is deciding when to perform the
classification, ie, after how many packets. Too early, and the decision basis is too thin to …