Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …
learning model beyond the memory limits of a single device. In this paper, we demonstrate …
Fast distributed inference serving for large language models
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …
exemplified by ChatGPT. The interactive nature of these applications demands low latency …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences
Many intelligent applications like autonomous driving and virtual reality require running both
latency-critical and best-effort DNN inference tasks to achieve both real time and work …
latency-critical and best-effort DNN inference tasks to achieve both real time and work …
Defending batch-level label inference and replacement attacks in vertical federated learning
In a vertical federated learning (VFL) scenario where features and models are split into
different parties, it has been shown that sample-level gradient information can be exploited …
different parties, it has been shown that sample-level gradient information can be exploited …
Power-aware Deep Learning Model Serving with {μ-Serve}
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …
pressing need to reduce the energy consumption of a model-serving cluster while …
Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
Deep learning (DL) is becoming increasingly popular in many domains, including computer
vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …
vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …
Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}
DNN models across many domains continue to grow in size, resulting in high resource
requirements for effective training, and unpalatable (and often unaffordable) costs for …
requirements for effective training, and unpalatable (and often unaffordable) costs for …
Transparent {GPU} sharing in container clouds for deep learning workloads
Containers are widely used for resource management in datacenters. A common practice to
support deep learning (DL) training in container clouds is to statically bind GPUs to …
support deep learning (DL) training in container clouds is to statically bind GPUs to …