Retrievalattention: Accelerating long-context llm inference via vector retrieval

D Liu, M Chen, B Lu, H Jiang, Z Han, Q Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer-based Large Language Models (LLMs) have become increasingly important.
However, due to the quadratic time complexity of attention computation, scaling LLMs to …

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arxiv preprint arxiv:2411.01142, 2024 - arxiv.org
Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

X He, S Zhang, Y Wang, H Yin, Z Zeng, S Shi… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language
Models (LLMs) in terms of performance, face significant deployment challenges during …

DeepFlow: Serverless Large Language Model Serving at Scale

J Hu, J Xu, Z Liu, Y He, Y Chen, H Xu, J Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces DeepFlow, a scalable and serverless AI platform designed to
efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow …

iServe: An Intent-based Serving System for LLMs

D Liakopoulos, T Hu, P Sinha… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models (LLMs) are becoming ubiquitous across industries, where
applications demand they fulfill diverse user intents. However, developers currently face the …

KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

R Cheng, Y Peng, Y Lai, X Wei, R Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The stateful nature of large language model (LLM) servingcan easily throttle precious GPU
memory under load burstor long-generation requests like chain-of-thought reasoning …

[CITACE][C] Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques

R Wanga, Z Gaoa, L Zhanga, S Yuea, Z Gaoa