Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

Learning scheduling algorithms for data processing clusters

H Mao, M Schwarzkopf, SB Venkatakrishnan… - Proceedings of the …, 2019 - dl.acm.org
Efficiently scheduling data processing jobs on distributed compute clusters requires complex
algorithms. Current systems use simple, generalized heuristics and ignore workload …

Serverless computing: One step forward, two steps back

JM Hellerstein, J Faleiro, JE Gonzalez… - arxiv preprint arxiv …, 2018 - arxiv.org
Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you
go manner. In this paper we address critical gaps in first-generation serverless computing …

{Heterogeneity-Aware} cluster scheduling policies for deep learning workloads

D Narayanan, K Santhanam, F Kazhamiaka… - … USENIX Symposium on …, 2020 - usenix.org
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been
increasingly deployed to train deep learning models. These accelerators exhibit …

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org
Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

Icebreaker: Warming serverless functions better with heterogeneity

RB Roy, T Patel, D Tiwari - Proceedings of the 27th ACM International …, 2022 - dl.acm.org
Serverless computing, an emerging computing model, relies on" warming up" functions prior
to its anticipated execution for faster and cost-effective service to users. Unfortunately …

Faster and cheaper serverless computing on harvested resources

Y Zhang, Í Goiri, GI Chaudhry, R Fonseca… - Proceedings of the …, 2021 - dl.acm.org
Serverless computing is becoming increasingly popular due to its ease of programming, fast
elasticity, and fine-grained billing. However, the serverless provider still needs to provision …

ByteGNN: efficient graph neural network training at large scale

C Zheng, H Chen, Y Cheng, Z Song, Y Wu… - Proceedings of the …, 2022 - dl.acm.org
Graph neural networks (GNNs) have shown excellent performance in a wide range of
applications such as recommendation, risk control, and drug discovery. With the increase in …