Load-balancing algorithms in cloud computing: A survey

EJ Ghomi, AM Rahmani, NN Qader - Journal of Network and Computer …, 2017 - Elsevier
Cloud computing is a modern paradigm to provide services through the Internet. Load
balancing is a key aspect of cloud computing and avoids the situation in which some nodes …

Resource management in clouds: Survey and research challenges

B Jennings, R Stadler - Journal of Network and Systems Management, 2015 - Springer
Resource management in a cloud environment is a hard problem, due to: the scale of
modern data centers; the heterogeneity of resource types and their interdependencies; the …

Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable

C Lin, Z Han, C Zhang, Y Yang, F Yang… - … USENIX Symposium on …, 2024 - usenix.org
The rise of large language models (LLMs) has enabled LLM-based applications (aka AI
agents or co-pilots), a new software paradigm that combines the strength of LLM and …

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org
Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

Gradient coding: Avoiding stragglers in distributed learning

R Tandon, Q Lei, AG Dimakis… - … on Machine Learning, 2017 - proceedings.mlr.press
We propose a novel coding theoretic framework for mitigating stragglers in distributed
learning. We show how carefully replicating data blocks and coding across gradients can …

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

Q Yu, M Maddah-Ali… - Advances in Neural …, 2017 - proceedings.neurips.cc
We consider a large-scale matrix multiplication problem where the computation is carried
out using a distributed system with a master node and multiple worker nodes, where each …

Speeding up distributed machine learning using codes

K Lee, M Lam, R Pedarsani… - IEEE Transactions …, 2017 - ieeexplore.ieee.org
Codes are widely used in many engineering applications to offer robustness against noise.
In large-scale systems, there are several types of noise that can affect the performance of …

Social big data: Recent achievements and new challenges

G Bello-Orgaz, JJ Jung, D Camacho - Information Fusion, 2016 - Elsevier
Big data has become an important issue for a large number of research areas such as data
mining, machine learning, computational intelligence, information fusion, the semantic Web …

Shuffling, fast and slow: Scalable analytics on serverless infrastructure

Q Pu, S Venkataraman, I Stoica - 16th USENIX symposium on networked …, 2019 - usenix.org
Serverless computing is poised to fulfill the long-held promise of transparent elasticity and
millisecond-level pricing. To achieve this goal, service providers impose a finegrained …

Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding

Q Yu, MA Maddah-Ali… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
We consider the problem of massive matrix multiplication, which underlies many data
analytic applications, in a large-scale distributed system comprising a group of worker …