Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …
art results in various domains, such as image recognition and natural language processing …
Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
Population based training of neural networks
Neural networks dominate the modern machine learning landscape, but their training and
success still suffer from sensitivity to empirical choices of hyperparameters such as model …
success still suffer from sensitivity to empirical choices of hyperparameters such as model …
Gandiva: Introspective cluster scheduling for deep learning
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …
knowledge to improve latency and efficiency of training deep learning models in a GPU …
Themis: Fair and efficient {GPU} cluster scheduling
Modern distributed machine learning (ML) training workloads benefit significantly from
leveraging GPUs. However, significant contention ensues when multiple such workloads are …
leveraging GPUs. However, significant contention ensues when multiple such workloads are …
Ekya: Continuous learning of video analytics models on edge compute servers
Video analytics applications use edge compute servers for processing videos. Compressed
models that are deployed on the edge servers for inference suffer from data drift where the …
models that are deployed on the edge servers for inference suffer from data drift where the …
An empirical study on program failures of deep learning jobs
Deep learning has made significant achievements in many application areas. To train and
test models more efficiently, enterprise developers submit and run their deep learning …
test models more efficiently, enterprise developers submit and run their deep learning …
Learning intrinsic sparse structures within long short-term memory
Model compression is significant for the wide adoption of Recurrent Neural Networks
(RNNs) in both user devices possessing limited resources and business clusters requiring …
(RNNs) in both user devices possessing limited resources and business clusters requiring …
Eight years of AutoML: categorisation, review and trends
Abstract Knowledge extraction through machine learning techniques has been successfully
applied in a large number of application domains. However, apart from the required …
applied in a large number of application domains. However, apart from the required …
Quiver: An informed storage cache for deep learning
We introduce Quiver, an informed storage cache for deep learning training (DLT) jobs in a
cluster of GPUs. Quiver employs domain-specific intelligence within the caching layer, to …
cluster of GPUs. Quiver employs domain-specific intelligence within the caching layer, to …