A comprehensive survey on coded distributed computing: Fundamentals, challenges, and networking applications
Distributed computing has become a common approach for large-scale computation tasks
due to benefits such as high reliability, scalability, computation speed, and cost …
due to benefits such as high reliability, scalability, computation speed, and cost …
Task scheduling approaches in fog computing: A systematic review
MR Alizadeh, V Khajehvand… - International Journal …, 2020 - Wiley Online Library
Summary The Internet of Things (IoT) interconnects billions of physical objects to collect and
exchange information and makes available various applications. Despite all the advantages …
exchange information and makes available various applications. Despite all the advantages …
Fairness in serving large language models
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …
requests from short chat conversations to long document reading. To ensure that all client …
Learning scheduling algorithms for data processing clusters
Efficiently scheduling data processing jobs on distributed compute clusters requires complex
algorithms. Current systems use simple, generalized heuristics and ignore workload …
algorithms. Current systems use simple, generalized heuristics and ignore workload …
Gandiva: Introspective cluster scheduling for deep learning
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …
knowledge to improve latency and efficiency of training deep learning models in a GPU …
Beyond data and model parallelism for deep neural networks.
Existing deep learning systems commonly parallelize deep neural network (DNN) training
using data or model parallelism, but these strategies often result in suboptimal …
using data or model parallelism, but these strategies often result in suboptimal …
Resource management with deep reinforcement learning
Resource management problems in systems and networking often manifest as difficult
online decision making tasks where appropriate solutions depend on understanding the …
online decision making tasks where appropriate solutions depend on understanding the …
Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …
beginning to incorporate machine learning models across a number of products. These …
Occupy the cloud: Distributed computing for the 99%
Distributed computing remains inaccessible to a large number of users, in spite of many
open source platforms and extensive commercial offerings. While distributed computation …
open source platforms and extensive commercial offerings. While distributed computation …
Large-scale cluster management at Google with Borg
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from
many thousands of different applications, across a number of clusters each with up to tens of …
many thousands of different applications, across a number of clusters each with up to tens of …