Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X ** - Proceedings of the ACM …, 2024 - dl.acm.org
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

A survey of low-bit large language models: Basics, systems, and algorithms

R Gong, Y Ding, Z Wang, C Lv, X Zheng, J Du… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have achieved remarkable advancements in natural
language processing, showcasing exceptional performance across various tasks. However …

{USHER}: Holistic Interference Avoidance for Resource Optimized {ML} Inference

SS Shubha, H Shen, A Iyer - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Minimizing monetary cost and maximizing the goodput of inference serving systems are
increasingly important with the ever-increasing popularity of deep learning models. While it …

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D **e… - arxiv preprint arxiv …, 2024 - arxiv.org
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Stateful large language model serving with pensieve

L Yu, J Lin, J Li - arxiv preprint arxiv:2312.05516, 2023 - arxiv.org
Large Language Models (LLMs) are wildly popular today and it is important to serve them
efficiently. Existing LLM serving systems are stateless across requests. Consequently, when …

Fast state restoration in LLM serving with hcache

S Gao, Y Chen, J Shu - arxiv preprint arxiv:2410.05004, 2024 - arxiv.org
The growing complexity of LLM usage today, eg, multi-round conversation and retrieval-
augmented generation (RAG), makes contextual states (ie, KV cache) reusable across user …