Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

A generic communication scheduler for distributed DNN training acceleration

Y Peng, Y Zhu, Y Chen, Y Bao, B Yi, C Lan… - Proceedings of the 27th …, 2019 - dl.acm.org
We present ByteScheduler, a generic communication scheduler for distributed DNN training
acceleration. ByteScheduler is based on our principled analysis that partitioning and …

Firecaffe: near-linear acceleration of deep neural network training on compute clusters

FN Iandola, MW Moskewicz… - Proceedings of the …, 2016 - openaccess.thecvf.com
Long training times for high-accuracy deep neural networks (DNNs) impede research into
new DNN architectures and slow the development of high-accuracy DNNs. In this paper we …

SiP-ML: high-bandwidth optical network interconnects for machine learning training

M Khani, M Ghobadi, M Alizadeh, Z Zhu… - Proceedings of the …, 2021 - dl.acm.org
This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

S Rajasekaran, M Ghobadi, A Akella - 21st USENIX Symposium on …, 2024 - usenix.org
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …

{KungFu}: Making training in distributed machine learning adaptive

L Mai, G Li, M Wagenländer, K Fertakis… - … USENIX Symposium on …, 2020 - usenix.org
When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …

Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update

C Sima, Y Fu, MK Sit, L Guo, X Gong, F Lin… - … USENIX Symposium on …, 2022 - usenix.org
Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …

Preemptive all-reduce scheduling for expediting distributed DNN training

Y Bao, Y Peng, Y Chen, C Wu - IEEE INFOCOM 2020-IEEE …, 2020 - ieeexplore.ieee.org
Data-parallel training is widely used for scaling DNN training over large datasets, using the
parameter server or all-reduce architecture. Communication scheduling has been promising …

SmartPC: Hierarchical pace control in real-time federated learning system

L Li, H **ong, Z Guo, J Wang… - 2019 IEEE Real-Time …, 2019 - ieeexplore.ieee.org
Federated Learning is a technique for learning AI models through the collaboration of a
large number of resourceconstrained mobile devices, while preserving data privacy. Instead …

Communication optimization strategies for distributed deep neural network training: A survey

S Ouyang, D Dong, Y Xu, L **ao - Journal of Parallel and Distributed …, 2021 - Elsevier
Recent trends in high-performance computing and deep learning have led to the
proliferation of studies on large-scale deep neural network training. However, the frequent …