Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

M\'elange: Cost efficient large language model serving by exploiting gpu heterogeneity

T Griggs, X Liu, J Yu, D Kim, WL Chiang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are increasingly integrated into many online services, yet
they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances …

Cachegen: Kv cache compression and streaming for fast large language model serving

Y Liu, H Li, Y Cheng, S Ray, Y Huang… - Proceedings of the …, 2024 - dl.acm.org
As large language models (LLMs) take on complex tasks, their inputs are supplemented with
longer contexts that incorporate domain knowledge. Yet using long contexts is challenging …

Queue management for slo-oriented large language model serving

A Patke, D Reddy, S Jha, H Qiu, C Pinto… - Proceedings of the …, 2024 - dl.acm.org
Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …

Intelligent router for llm workloads: Improving performance through workload-aware scheduling

K Jain, A Parayil, A Mallick, E Choukse, X Qin… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Model (LLM) workloads have distinct prefill and decode phases with
different compute and memory requirements which should ideally be accounted for when …

SocialMind: LLM-based Proactive AR Social Assistive System with Human-like Perception for In-situ Live Interactions

B Yang, Y Guo, L Xu, Z Yan, H Chen, G **ng… - arxiv preprint arxiv …, 2024 - arxiv.org
Social interactions are fundamental to human life. The recent emergence of large language
models (LLMs)-based virtual assistants has demonstrated their potential to revolutionize …

Efficient LLM Scheduling by Learning to Rank

Y Fu, S Zhu, R Su, A Qiao, I Stoica, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
In Large Language Model (LLM) inference, the output length of an LLM request is typically
regarded as not known a priori. Consequently, most LLM serving systems employ a simple …

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Z Li, Z Chen, R Delacourt, G Oliaro, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces AdaServe, the first LLM serving system to support SLO customization
through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to …

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

Z Wu, Z Zhou, A Verma, A Prakash, D Rus… - arxiv preprint arxiv …, 2025 - arxiv.org
We propose TETRIS, a novel method that optimizes the total throughput of batch speculative
decoding in multi-request settings. Unlike existing methods that optimize for a single request …

Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View

Y Wu, I Hua, Y Ding - arxiv preprint arxiv:2502.11256, 2025 - arxiv.org
Large language models (LLMs) offer powerful capabilities but come with significant
environmental costs, particularly in carbon emissions. Existing studies benchmark these …