Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W **ao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org
With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

Oort: Efficient federated learning via guided participant selection

F Lai, X Zhu, HV Madhyastha… - 15th {USENIX} Symposium …, 2021 - usenix.org
Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that
enables in-situ model training and testing on edge data. Despite having the same end goals …

Fairness in serving large language models

Y Sheng, S Cao, D Li, B Zhu, Z Li, D Zhuo… - … USENIX Symposium on …, 2024 - usenix.org
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …

{INFaaS}: Automated model-less inference serving

F Romero, Q Li, NJ Yadwadkar… - 2021 USENIX Annual …, 2021 - usenix.org
Despite existing work in machine learning inference serving, ease-of-use and cost efficiency
remain challenges at large scales. Developers must manually search through thousands of …

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

{Heterogeneity-Aware} cluster scheduling policies for deep learning workloads

D Narayanan, K Santhanam, F Kazhamiaka… - … USENIX Symposium on …, 2020 - usenix.org
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been
increasingly deployed to train deep learning models. These accelerators exhibit …

Characterization and prediction of deep learning workloads in large-scale gpu datacenters

Q Hu, P Sun, S Yan, Y Wen, T Zhang - Proceedings of the International …, 2021 - dl.acm.org
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …

{MAST}: Global scheduling of {ML} training across {Geo-Distributed} datacenters at hyperscale

A Choudhury, Y Wang, T Pelkonen… - … USENIX Symposium on …, 2024 - usenix.org
In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …