- Academic Search

Q Weng, W **ao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org

With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

Save Cite Cited by 296 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] acm.org

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Save Cite Cited by 21 Related articles All 4 versions Free GPT-4

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

Save Cite Cited by 11 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] usenix.org

Gandiva: Introspective cluster scheduling for deep learning

W **ao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

Save Cite Cited by 598 Related articles All 12 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] usenix.org

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org

Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

Save Cite Cited by 447 Related articles All 13 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] github.io

[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale

A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io

In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …

Save Cite Cited by 11 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] usenix.org

{HiveD}: Sharing a {GPU} cluster for deep learning with guarantees

H Zhao, Z Han, Z Yang, Q Zhang, F Yang… - … USENIX symposium on …, 2020 - usenix.org

Deep learning training on a shared GPU cluster is becoming a common practice. However,
we observe severe sharing anomaly in production multi-tenant clusters where jobs in some …

Save Cite Cited by 96 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arxiv preprint arxiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Save Cite Cited by 35 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Comet: a novel memory-efficient deep learning training framework by using error-bounded lossy compression

S **, C Zhang, X Jiang, Y Feng, H Guan, G Li… - arxiv preprint arxiv …, 2021 - arxiv.org

Training wide and deep neural networks (DNNs) require large amounts of storage resources
such as memory because the intermediate activation data must be saved in the memory …

Save Cite Cited by 26 Related articles All 10 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] nsf.gov

Kube-knots: Resource harvesting through dynamic container orchestration in gpu-based datacenters

P Thinakaran, JR Gunasekaran… - … on cluster computing …, 2019 - ieeexplore.ieee.org

Compute heterogeneity is increasingly gaining prominence in modern datacenters due to
the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers …

Save Cite Cited by 50 Related articles All 8 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Topology-aware gpu scheduling for learning workloads in cloud environments

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Deep learning workload scheduling in gpu datacenters: A survey

A survey on scheduling techniques in computing and network convergence

Gandiva: Introspective cluster scheduling for deep learning

Tiresias: A {GPU} cluster manager for distributed deep learning

[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale

{HiveD}: Sharing a {GPU} cluster for deep learning with guarantees

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

Comet: a novel memory-efficient deep learning training framework by using error-bounded lossy compression

Kube-knots: Resource harvesting through dynamic container orchestration in gpu-based datacenters