Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu

D Xu, H Zhang, L Yang, R Liu, G Huang, M Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T **e… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

Cachegen: Kv cache compression and streaming for fast large language model serving

Y Liu, H Li, Y Cheng, S Ray, Y Huang… - Proceedings of the …, 2024 - dl.acm.org
As large language models (LLMs) take on complex tasks, their inputs are supplemented with
longer contexts that incorporate domain knowledge. Yet using long contexts is challenging …

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

J Liu, JW Chung, Z Wu, F Lai, M Lee… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are now at the core of conversational AI services such as
real-time translation and chatbots, which provide live user interaction by incrementally …

Shortcut-connected expert parallelism for accelerating mixture-of-experts

W Cai, J Jiang, L Qin, J Cui, S Kim, J Huang - arxiv preprint arxiv …, 2024 - arxiv.org
Expert parallelism has been introduced as a strategy to distribute the computational
workload of sparsely-gated mixture-of-experts (MoE) models across multiple computing …

Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations

A Agrawal, J Chen, Í Goiri, R Ramjee, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) evolve to handle increasingly longer contexts, serving
inference requests for context lengths in the range of millions of tokens presents unique …

SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling

H Wang, J Fang, X Tang, Z Yue, J Li… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Benefiting from the self-attention mechanism, Transformer models have attained impressive
contextual comprehension capabilities for lengthy texts. The requirements of high …