Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024‏ - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

{THC}: Accelerating distributed deep learning using tensor homomorphic compression

M Li, RB Basat, S Vargaftik, CL Lao, K Xu… - … USENIX Symposium on …, 2024‏ - usenix.org
Deep neural networks (DNNs) are the de facto standard for essential use cases, such as
image classification, computer vision, and natural language processing. As DNNs and …

Resource allocation and workload scheduling for large-scale distributed deep learning: A survey

F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arxiv preprint arxiv …, 2024‏ - arxiv.org
With rapidly increasing distributed deep learning workloads in large-scale data centers,
efficient distributed deep learning framework strategies for resource allocation and workload …

Crux: Gpu-efficient communication scheduling for deep learning training

J Cao, Y Guan, K Qian, J Gao, W **ao, J Dong… - Proceedings of the …, 2024‏ - dl.acm.org
Deep learning training (DLT), eg, large language model (LLM) training, has become one of
the most important services in multitenant cloud computing. By deeply studying in …

MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning

S Rajasekaran, S Narang, AA Zabreyko… - Proceedings of the 23rd …, 2024‏ - dl.acm.org
This paper argues that congestion control protocols in machine learning datacenters sit at a
sweet spot between centralized and distributed flow scheduling solutions. We present …

Mltcp: Congestion control for dnn training

S Rajasekaran, S Narang, AA Zabreyko… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We present MLTCP, a technique to augment today's congestion control algorithms to
accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication …

PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks

L Liu, X Xu, P Zhou, X Chen, D Ergu, H Yu, G Sun… - Neurocomputing, 2025‏ - Elsevier
With the increasing size of training datasets and models, parameter synchronization stage
puts a heavy burden on the network, and communication has become one of the main …

Understanding the Throughput Bounds of Reconfigurable Datacenter Networks

V Addanki, C Avin, S Schmid - arxiv preprint arxiv:2405.20869, 2024‏ - arxiv.org
The increasing gap between the growth of datacenter traffic volume and the capacity of
electrical switches led to the emergence of reconfigurable datacenter network designs …

Communication optimization for distributed training: architecture, advances, and opportunities

Y Wei, T Hu, C Liang, Y Cui - IEEE Network, 2024‏ - ieeexplore.ieee.org
The past few years have witnessed the flourishing of large-scale deep neural network
models with ever-growing parameter numbers. Training such large-scale models typically …

Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning System

Y Li, J Huang, Z Li, J Liu, S Zhou… - IEEE/ACM …, 2024‏ - ieeexplore.ieee.org
Deep Neural Network (DNN) is a critical component of a wide range of applications.
However, with the rapid growth of the training dataset and model size, communication …