Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Machine learning methods for reliable resource provisioning in edge-cloud computing: A survey

TL Duc, RG Leiva, P Casari, PO Östberg - ACM Computing Surveys …, 2019 - dl.acm.org
Large-scale software systems are currently designed as distributed entities and deployed in
cloud data centers. To overcome the limitations inherent to this type of deployment …

Pond: Cxl-based memory pooling systems for cloud platforms

H Li, DS Berger, L Hsu, D Ernst, P Zardoshti… - Proceedings of the 28th …, 2023 - dl.acm.org
Public cloud providers seek to meet stringent performance requirements and low hardware
cost. A key driver of performance and cost is main memory. Memory pooling promises to …

Learning scheduling algorithms for data processing clusters

H Mao, M Schwarzkopf, SB Venkatakrishnan… - Proceedings of the …, 2019 - dl.acm.org
Efficiently scheduling data processing jobs on distributed compute clusters requires complex
algorithms. Current systems use simple, generalized heuristics and ignore workload …

Autopilot: workload autoscaling at google

K Rzadca, P Findeisen, J Swiderski, P Zych… - Proceedings of the …, 2020 - dl.acm.org
In many public and private Cloud systems, users need to specify a limit for the amount of
resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits …

Resource management with deep reinforcement learning

H Mao, M Alizadeh, I Menache, S Kandula - Proceedings of the 15th …, 2016 - dl.acm.org
Resource management problems in systems and networking often manifest as difficult
online decision making tasks where appropriate solutions depend on understanding the …

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

Learning to perform local rewriting for combinatorial optimization

X Chen, Y Tian - Advances in neural information …, 2019 - proceedings.neurips.cc
Search-based methods for hard combinatorial optimization are often guided by heuristics.
Tuning heuristics in various conditions and situations is often time-consuming. In this paper …

Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing

S Choi, S Lee, Y Kim, J Park, Y Kwon… - 2022 USENIX Annual …, 2022 - usenix.org
As machine learning (ML) techniques are applied to a widening range of applications, high
throughput ML inference serving has become critical for online services. Such ML inference …

Netllm: Adapting large language models for networking

D Wu, X Wang, Y Qiao, Z Wang, J Jiang, S Cui… - Proceedings of the …, 2024 - dl.acm.org
Many networking tasks now employ deep learning (DL) to solve complex prediction and
optimization problems. However, current design philosophy of DL-based algorithms entails …