Oort: Efficient federated learning via guided participant selection
Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that
enables in-situ model training and testing on edge data. Despite having the same end goals …
enables in-situ model training and testing on edge data. Despite having the same end goals …
{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters
With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …
Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters
Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …
GPUs and CPUs for computation and network bandwidth for distributed training. However …
A survey on scheduling techniques in computing and network convergence
S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …
computing power. This trend results in the urgent need for higher-level computing resource …
{INFaaS}: Automated model-less inference serving
Despite existing work in machine learning inference serving, ease-of-use and cost efficiency
remain challenges at large scales. Developers must manually search through thousands of …
remain challenges at large scales. Developers must manually search through thousands of …
{Heterogeneity-Aware} cluster scheduling policies for deep learning workloads
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been
increasingly deployed to train deep learning models. These accelerators exhibit …
increasingly deployed to train deep learning models. These accelerators exhibit …
Fast distributed inference serving for large language models
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …
exemplified by ChatGPT. The interactive nature of these applications demands low latency …
Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …
beginning to incorporate machine learning models across a number of products. These …
A generic communication scheduler for distributed DNN training acceleration
We present ByteScheduler, a generic communication scheduler for distributed DNN training
acceleration. ByteScheduler is based on our principled analysis that partitioning and …
acceleration. ByteScheduler is based on our principled analysis that partitioning and …