Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024‏ - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X **… - … USENIX Symposium on …, 2023‏ - usenix.org
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H **… - arxiv preprint arxiv …, 2023‏ - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences

M Han, H Zhang, R Chen, H Chen - 16th USENIX Symposium on …, 2022‏ - usenix.org
Many intelligent applications like autonomous driving and virtual reality require running both
latency-critical and best-effort DNN inference tasks to achieve both real time and work …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Beware of fragmentation: Scheduling {GPU-Sharing} workloads with fragmentation gradient descent

Q Weng, L Yang, Y Yu, W Wang, X Tang… - 2023 USENIX Annual …, 2023‏ - usenix.org
Large tech companies are piling up a massive number of GPUs in their server fleets to run
diverse machine learning (ML) workloads. However, these expensive devices often suffer …

Llmcad: Fast and scalable on-device large language model inference

D Xu, W Yin, X **, Y Zhang, S Wei, M Xu… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Generative tasks, such as text generation and question answering, hold a crucial position in
the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a …

Orion: Interference-aware, fine-grained GPU sharing for ML applications

F Strati, X Ma, A Klimovic - … of the Nineteenth European Conference on …, 2024‏ - dl.acm.org
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …

Power-aware Deep Learning Model Serving with {μ-Serve}

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024‏ - usenix.org
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

Transparent {GPU} sharing in container clouds for deep learning workloads

B Wu, Z Zhang, Z Bai, X Liu, X ** - 20th USENIX Symposium on …, 2023‏ - usenix.org
Containers are widely used for resource management in datacenters. A common practice to
support deep learning (DL) training in container clouds is to statically bind GPUs to …