Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X ** - Proceedings of the ACM …, 2024 - dl.acm.org
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads

M Canini, R Bianchini, Í Goiri, D Kostić… - arxiv preprint arxiv …, 2025 - arxiv.org
AI workloads, often hosted in multi-tenant cloud environments, require vast computational
resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack …

SYMPHONY: Improving Memory Management for LLM Inference Workloads

S Agarwal, A Mao, A Akella… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) are increasingly being deployed in applications such as
chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to …

Towards Efficient Large Multimodal Model Serving

H Qiu, A Biswas, Z Zhao, J Mohan, A Khare… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advances in generative AI have led to large multi-modal models (LMMs) capable of
simultaneously processing inputs of various modalities such as text, images, video, and …

TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications

N Ling, G Chen, L Zhong - arxiv preprint arxiv:2412.18695, 2024 - arxiv.org
Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend
complex commands and process diverse tasks. This advancement facilitates their …

Revisiting SLO and Goodput Metrics in LLM Serving

Z Wang, S Li, Y Zhou, X Li, R Gu, N Cam-Tu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have achieved remarkable performance and are widely
deployed in various applications, while the serving of LLM inference has raised concerns …