Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …
art results in various domains, such as image recognition and natural language processing …
Machine learning methods for reliable resource provisioning in edge-cloud computing: A survey
Large-scale software systems are currently designed as distributed entities and deployed in
cloud data centers. To overcome the limitations inherent to this type of deployment …
cloud data centers. To overcome the limitations inherent to this type of deployment …
Pond: Cxl-based memory pooling systems for cloud platforms
Public cloud providers seek to meet stringent performance requirements and low hardware
cost. A key driver of performance and cost is main memory. Memory pooling promises to …
cost. A key driver of performance and cost is main memory. Memory pooling promises to …
Learning scheduling algorithms for data processing clusters
Efficiently scheduling data processing jobs on distributed compute clusters requires complex
algorithms. Current systems use simple, generalized heuristics and ignore workload …
algorithms. Current systems use simple, generalized heuristics and ignore workload …
Autopilot: workload autoscaling at google
K Rzadca, P Findeisen, J Swiderski, P Zych… - Proceedings of the …, 2020 - dl.acm.org
In many public and private Cloud systems, users need to specify a limit for the amount of
resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits …
resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits …
Resource management with deep reinforcement learning
Resource management problems in systems and networking often manifest as difficult
online decision making tasks where appropriate solutions depend on understanding the …
online decision making tasks where appropriate solutions depend on understanding the …
Optimus: an efficient dynamic resource scheduler for deep learning clusters
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …
Learning to perform local rewriting for combinatorial optimization
Search-based methods for hard combinatorial optimization are often guided by heuristics.
Tuning heuristics in various conditions and situations is often time-consuming. In this paper …
Tuning heuristics in various conditions and situations is often time-consuming. In this paper …
Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing
As machine learning (ML) techniques are applied to a widening range of applications, high
throughput ML inference serving has become critical for online services. Such ML inference …
throughput ML inference serving has become critical for online services. Such ML inference …
Netllm: Adapting large language models for networking
Many networking tasks now employ deep learning (DL) to solve complex prediction and
optimization problems. However, current design philosophy of DL-based algorithms entails …
optimization problems. However, current design philosophy of DL-based algorithms entails …