[HTML][HTML] Deep neural networks in the cloud: Review, applications, challenges and research directions

KY Chan, B Abu-Salih, R Qaddoura, AZ Ala'M… - Neurocomputing, 2023 - Elsevier
Deep neural networks (DNNs) are currently being deployed as machine learning technology
in a wide range of important real-world applications. DNNs consist of a huge number of …

Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}

J Thorpe, P Zhao, J Eyolfson, Y Qiao, Z Jia… - … USENIX Symposium on …, 2023 - usenix.org
DNN models across many domains continue to grow in size, resulting in high resource
requirements for effective training, and unpalatable (and often unaffordable) costs for …

DL2: A deep learning-driven scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Efficient resource scheduling is essential for maximal utilization of expensive deep learning
(DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) …

Heet: Accelerating elastic training in heterogeneous deep learning clusters

Z Mo, H Xu, C Xu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org
Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such
as computation and communication. This heterogeneity poses a significant challenge for the …

Crossbow: Scaling deep learning with small batch sizes on multi-gpu servers

A Koliousis, P Watcharapichat, M Weidlich… - arxiv preprint arxiv …, 2019 - arxiv.org
Deep learning models are trained on servers with many GPUs, and training must scale with
the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel …