Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
Power-aware Deep Learning Model Serving with {μ-Serve}
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …
pressing need to reduce the energy consumption of a model-serving cluster while …
{USHER}: Holistic Interference Avoidance for Resource Optimized {ML} Inference
Minimizing monetary cost and maximizing the goodput of inference serving systems are
increasingly important with the ever-increasing popularity of deep learning models. While it …
increasingly important with the ever-increasing popularity of deep learning models. While it …
Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent
Large tech companies are piling up a massive number of GPUs in their server fleets to run
diverse machine learning (ML) workloads. However, these expensive devices often suffer …
diverse machine learning (ML) workloads. However, these expensive devices often suffer …
Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters
GPU technology has been improving at an expedited pace in terms of size and performance,
empowering HPC and AI/ML researchers to advance the scientific discovery process …
empowering HPC and AI/ML researchers to advance the scientific discovery process …
Toward sustainable hpc: Carbon footprint estimation and environmental implications of hpc systems
The rapid growth in demand for HPC systems has led to a rise in carbon footprint, which
requires urgent intervention. In this work, we present a comprehensive analysis of the …
requires urgent intervention. In this work, we present a comprehensive analysis of the …
Efficient training of large language models on distributed infrastructures: a survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …
their sophisticated capabilities. Training these models requires vast GPU clusters and …
Chronus: A novel deadline-aware scheduler for deep learning training jobs
Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner.
Job scheduling is the key to improve the training performance, resource utilization and …
Job scheduling is the key to improve the training performance, resource utilization and …