Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Population based training of neural networks

M Jaderberg, V Dalibard, S Osindero… - arxiv preprint arxiv …, 2017 - arxiv.org
Neural networks dominate the modern machine learning landscape, but their training and
success still suffer from sensitivity to empirical choices of hyperparameters such as model …

Gandiva: Introspective cluster scheduling for deep learning

W **ao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

Themis: Fair and efficient {GPU} cluster scheduling

K Mahajan, A Balasubramanian, A Singhvi… - … USENIX Symposium on …, 2020 - usenix.org
Modern distributed machine learning (ML) training workloads benefit significantly from
leveraging GPUs. However, significant contention ensues when multiple such workloads are …

Ekya: Continuous learning of video analytics models on edge compute servers

R Bhardwaj, Z **a, G Ananthanarayanan… - … USENIX Symposium on …, 2022 - usenix.org
Video analytics applications use edge compute servers for processing videos. Compressed
models that are deployed on the edge servers for inference suffer from data drift where the …

An empirical study on program failures of deep learning jobs

R Zhang, W **ao, H Zhang, Y Liu, H Lin… - Proceedings of the ACM …, 2020 - dl.acm.org
Deep learning has made significant achievements in many application areas. To train and
test models more efficiently, enterprise developers submit and run their deep learning …

Learning intrinsic sparse structures within long short-term memory

W Wen, Y He, S Rajbhandari, M Zhang… - arxiv preprint arxiv …, 2017 - arxiv.org
Model compression is significant for the wide adoption of Recurrent Neural Networks
(RNNs) in both user devices possessing limited resources and business clusters requiring …

Eight years of AutoML: categorisation, review and trends

R Barbudo, S Ventura, JR Romero - Knowledge and Information Systems, 2023 - Springer
Abstract Knowledge extraction through machine learning techniques has been successfully
applied in a large number of application domains. However, apart from the required …

Quiver: An informed storage cache for deep learning

AV Kumar, M Sivathanu - 18th USENIX Conference on File and Storage …, 2020 - usenix.org
We introduce Quiver, an informed storage cache for deep learning training (DLT) jobs in a
cluster of GPUs. Quiver employs domain-specific intelligence within the caching layer, to …