Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …
art results in various domains, such as image recognition and natural language processing …
A generic communication scheduler for distributed DNN training acceleration
We present ByteScheduler, a generic communication scheduler for distributed DNN training
acceleration. ByteScheduler is based on our principled analysis that partitioning and …
acceleration. ByteScheduler is based on our principled analysis that partitioning and …
Firecaffe: near-linear acceleration of deep neural network training on compute clusters
Long training times for high-accuracy deep neural networks (DNNs) impede research into
new DNN architectures and slow the development of high-accuracy DNNs. In this paper we …
new DNN architectures and slow the development of high-accuracy DNNs. In this paper we …
SiP-ML: high-bandwidth optical network interconnects for machine learning training
This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …
{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …
{KungFu}: Making training in distributed machine learning adaptive
When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …
machines, users must configure a large number of parameters: hyper-parameters (eg the …
Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update
Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …
Preemptive all-reduce scheduling for expediting distributed DNN training
Data-parallel training is widely used for scaling DNN training over large datasets, using the
parameter server or all-reduce architecture. Communication scheduling has been promising …
parameter server or all-reduce architecture. Communication scheduling has been promising …
SmartPC: Hierarchical pace control in real-time federated learning system
Federated Learning is a technique for learning AI models through the collaboration of a
large number of resourceconstrained mobile devices, while preserving data privacy. Instead …
large number of resourceconstrained mobile devices, while preserving data privacy. Instead …
Communication optimization strategies for distributed deep neural network training: A survey
Recent trends in high-performance computing and deep learning have led to the
proliferation of studies on large-scale deep neural network training. However, the frequent …
proliferation of studies on large-scale deep neural network training. However, the frequent …