A comprehensive survey on coded distributed computing: Fundamentals, challenges, and networking applications

JS Ng, WYB Lim, NC Luong, Z **ong… - … Surveys & Tutorials, 2021 - ieeexplore.ieee.org
Distributed computing has become a common approach for large-scale computation tasks
due to benefits such as high reliability, scalability, computation speed, and cost …

Task scheduling approaches in fog computing: A systematic review

MR Alizadeh, V Khajehvand… - International Journal …, 2020 - Wiley Online Library
Summary The Internet of Things (IoT) interconnects billions of physical objects to collect and
exchange information and makes available various applications. Despite all the advantages …

Fairness in serving large language models

Y Sheng, S Cao, D Li, B Zhu, Z Li, D Zhuo… - … USENIX Symposium on …, 2024 - usenix.org
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …

Learning scheduling algorithms for data processing clusters

H Mao, M Schwarzkopf, SB Venkatakrishnan… - Proceedings of the …, 2019 - dl.acm.org
Efficiently scheduling data processing jobs on distributed compute clusters requires complex
algorithms. Current systems use simple, generalized heuristics and ignore workload …

Gandiva: Introspective cluster scheduling for deep learning

W **ao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

Beyond data and model parallelism for deep neural networks.

Z Jia, M Zaharia, A Aiken - Proceedings of Machine Learning …, 2019 - proceedings.mlsys.org
Existing deep learning systems commonly parallelize deep neural network (DNN) training
using data or model parallelism, but these strategies often result in suboptimal …

Resource management with deep reinforcement learning

H Mao, M Alizadeh, I Menache, S Kandula - Proceedings of the 15th …, 2016 - dl.acm.org
Resource management problems in systems and networking often manifest as difficult
online decision making tasks where appropriate solutions depend on understanding the …

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

M Jeon, S Venkataraman, A Phanishayee… - 2019 USENIX Annual …, 2019 - usenix.org
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …

Occupy the cloud: Distributed computing for the 99%

E Jonas, Q Pu, S Venkataraman, I Stoica… - Proceedings of the 2017 …, 2017 - dl.acm.org
Distributed computing remains inaccessible to a large number of users, in spite of many
open source platforms and extensive commercial offerings. While distributed computation …

Large-scale cluster management at Google with Borg

A Verma, L Pedrosa, M Korupolu… - Proceedings of the …, 2015 - dl.acm.org
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from
many thousands of different applications, across a number of clusters each with up to tens of …