XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale
The rapid proliferation of large language models has driven the need for efficient GPU
training clusters. However, ensuring high-performance training in these clusters is …
training clusters. However, ensuring high-performance training in these clusters is …
EdgeCross: Cloud Scale Traffic Management at Peering Edges
X Wang, P Mi, Y Zhu, B An, Y Wang, L Wang… - Proceedings of the …, 2024 - dl.acm.org
Cloud providers deployed dozens of PoPs and data centers globally to serve billions of geo-
distributed users. The traffic management at peering edges has become a key capability of …
distributed users. The traffic management at peering edges has become a key capability of …
LLM-Sketch: Enhancing Network Sketches with LLM
Y Li, Z Xu, Z Lv, Y Hu, Y Cui, T Yang - arxiv preprint arxiv:2502.07495, 2025 - arxiv.org
Network stream mining is fundamental to many network operations. Sketches, as compact
data structures that offer low memory overhead with bounded accuracy, have emerged as a …
data structures that offer low memory overhead with bounded accuracy, have emerged as a …