- Academic Search

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X ** - Proceedings of the ACM …, 2024 - dl.acm.org

The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

Save Cite Cited by 18 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads

M Canini, R Bianchini, Í Goiri, D Kostić… - arxiv preprint arxiv …, 2025 - arxiv.org

AI workloads, often hosted in multi-tenant cloud environments, require vast computational
resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack …

[Free GPT-4]

[PDF] arxiv.org

SYMPHONY: Improving Memory Management for LLM Inference Workloads

S Agarwal, A Mao, A Akella… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) are increasingly being deployed in applications such as
chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to …

[Free GPT-4]

[PDF] arxiv.org

Towards Efficient Large Multimodal Model Serving

H Qiu, A Biswas, Z Zhao, J Mohan, A Khare… - arxiv preprint arxiv …, 2025 - arxiv.org

Recent advances in generative AI have led to large multi-modal models (LMMs) capable of
simultaneously processing inputs of various modalities such as text, images, video, and …

[Free GPT-4]

[PDF] arxiv.org

TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications

N Ling, G Chen, L Zhong - arxiv preprint arxiv:2412.18695, 2024 - arxiv.org

Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend
complex commands and process diverse tasks. This advancement facilitates their …

[Free GPT-4]

[PDF] arxiv.org

Revisiting SLO and Goodput Metrics in LLM Serving

Z Wang, S Li, Y Zhou, X Li, R Gu, N Cam-Tu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have achieved remarkable performance and are widely
deployed in various applications, while the serving of LLM inference has raised concerns …

Cite

Advanced search

Saved to My library

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads

SYMPHONY: Improving Memory Management for LLM Inference Workloads

Towards Efficient Large Multimodal Model Serving

TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications

Revisiting SLO and Goodput Metrics in LLM Serving