Datacenter traffic control: Understanding techniques and tradeoffs

M Noormohammadpour… - … Surveys & Tutorials, 2017‏ - ieeexplore.ieee.org
Datacenters provide cost-effective and flexible access to scalable compute and storage
resources necessary for today's cloud computing needs. A typical datacenter is made up of …

{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024‏ - usenix.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

Congestion control in named data networking–a survey

Y Ren, J Li, S Shi, L Li, G Wang, B Zhang - Computer Communications, 2016‏ - Elsevier
Abstract As a typical Information Centric Networking, Named Data Networking (NDN) has
attracted wide research attentions in recent years. NDN evolves today's host-centric network …

HPCC: High precision congestion control

Y Li, R Miao, HH Liu, Y Zhuang, F Feng… - Proceedings of the …, 2019‏ - dl.acm.org
Congestion control (CC) is the key to achieving ultra-low latency, high bandwidth and
network stability in high-speed networks. From years of experience operating large-scale …

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020‏ - usenix.org
Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

Swift: Delay is simple and effective for congestion control in the datacenter

G Kumar, N Dukkipati, K Jang, HMG Wassel… - Proceedings of the …, 2020‏ - dl.acm.org
We report on experiences with Swift congestion control in Google datacenters. Swift targets
an end-to-end delay by using AIMD control, with pacing under extreme congestion. With …

Azure accelerated networking:{SmartNICs} in the public cloud

D Firestone, A Putnam, S Mundkur, D Chiou… - … USENIX Symposium on …, 2018‏ - usenix.org
Modern cloud architectures rely on each server running its own networking stack to
implement policies such as tunneling for virtual networks, security, and load balancing …

Homa: A receiver-driven low-latency transport protocol using network priorities

B Montazeri, Y Li, M Alizadeh… - Proceedings of the 2018 …, 2018‏ - dl.acm.org
Homa is a new transport protocol for datacenter networks. It provides exceptionally low
latency, especially for workloads with a high volume of very short messages, and it also …

Understanding host network stack overheads

Q Cai, S Chaudhary, M Vuppalapati, J Hwang… - Proceedings of the …, 2021‏ - dl.acm.org
Traditional end-host network stacks are struggling to keep up with rapidly increasing
datacenter access link bandwidths due to their unsustainable CPU overheads. Motivated by …

A cloud-scale acceleration architecture

AM Caulfield, ES Chung, A Putnam… - 2016 49th Annual …, 2016‏ - ieeexplore.ieee.org
Hyperscale datacenter providers have struggled to balance the growing need for
specialized hardware (efficiency) with the economic benefits of homogeneity …