Datacenter traffic control: Understanding techniques and tradeoffs

M Noormohammadpour… - … Surveys & Tutorials, 2017 - ieeexplore.ieee.org
Datacenters provide cost-effective and flexible access to scalable compute and storage
resources necessary for today's cloud computing needs. A typical datacenter is made up of …

Congestion control in named data networking–a survey

Y Ren, J Li, S Shi, L Li, G Wang, B Zhang - Computer Communications, 2016 - Elsevier
Abstract As a typical Information Centric Networking, Named Data Networking (NDN) has
attracted wide research attentions in recent years. NDN evolves today's host-centric network …

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020 - usenix.org
Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

Swift: Delay is simple and effective for congestion control in the datacenter

G Kumar, N Dukkipati, K Jang, HMG Wassel… - Proceedings of the …, 2020 - dl.acm.org
We report on experiences with Swift congestion control in Google datacenters. Swift targets
an end-to-end delay by using AIMD control, with pacing under extreme congestion. With …

HPCC: High precision congestion control

Y Li, R Miao, HH Liu, Y Zhuang, F Feng… - Proceedings of the …, 2019 - dl.acm.org
Congestion control (CC) is the key to achieving ultra-low latency, high bandwidth and
network stability in high-speed networks. From years of experience operating large-scale …

Azure accelerated networking:{SmartNICs} in the public cloud

D Firestone, A Putnam, S Mundkur, D Chiou… - … USENIX Symposium on …, 2018 - usenix.org
Modern cloud architectures rely on each server running its own networking stack to
implement policies such as tunneling for virtual networks, security, and load balancing …

A cloud-scale acceleration architecture

AM Caulfield, ES Chung, A Putnam… - 2016 49th Annual …, 2016 - ieeexplore.ieee.org
Hyperscale datacenter providers have struggled to balance the growing need for
specialized hardware (efficiency) with the economic benefits of homogeneity …

Homa: A receiver-driven low-latency transport protocol using network priorities

B Montazeri, Y Li, M Alizadeh… - Proceedings of the 2018 …, 2018 - dl.acm.org
Homa is a new transport protocol for datacenter networks. It provides exceptionally low
latency, especially for workloads with a high volume of very short messages, and it also …

An exhaustive survey on p4 programmable data plane switches: Taxonomy, applications, challenges, and future trends

EF Kfoury, J Crichigno, E Bou-Harb - IEEE access, 2021 - ieeexplore.ieee.org
Traditionally, the data plane has been designed with fixed functions to forward packets using
a small set of protocols. This closed-design paradigm has limited the capability of the …

{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024 - usenix.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …