Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X **… - … USENIX Symposium on …, 2023 - usenix.org
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Llumnix: Dynamic scheduling for large language model serving

B Sun, Z Huang, H Zhao, W **ao, X Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Inference serving for large language models (LLMs) is the key to unleashing their potential
in people's daily lives. However, efficient LLM serving remains challenging today because …

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X **, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

{ServerlessLLM}:{Low-Latency} serverless inference for large language models

Y Fu, L Xue, Y Huang, AO Brabete, D Ustiugov… - … USENIX Symposium on …, 2024 - usenix.org
This paper presents ServerlessLLM, a distributed system designed to support low-latency
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

{dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving

B Wu, R Zhu, Z Zhang, P Sun, X Liu, X ** - 18th USENIX Symposium on …, 2024 - usenix.org
Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …