Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters
With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …
Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …
Characterization and prediction of deep learning workloads in large-scale gpu datacenters
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …
in both the research community and industry. When operating a datacenter, optimization of …
Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale
A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io
In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …
data and launch ML training workloads in the same region to ensure data and computation …
Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
Deep learning (DL) is becoming increasingly popular in many domains, including computer
vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …
vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …
Multi-resource interleaving for deep learning training
Training Deep Learning (DL) model requires multiple resource types, including CPUs,
GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …
GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …