vattention: Dynamic memory management for serving llms without pagedattention

R Prabhu, A Nayak, J Mohan, R Ramjee… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Efficient management of GPU memory is essential for high throughput LLM inference. Prior
systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity …

Tackling the dynamicity in a production llm serving system with sota optimizations via hybrid prefill/decode/verify scheduling on efficient meta-kernels

M Song, X Tang, F Hou, J Li, W Wei, Y Ma… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Meeting growing demands for low latency and cost efficiency in production-grade large
language model (LLM) serving systems requires integrating advanced optimization …

Towards Efficient Large Multimodal Model Serving

H Qiu, A Biswas, Z Zhao, J Mohan, A Khare… - arxiv preprint arxiv …, 2025‏ - arxiv.org
Recent advances in generative AI have led to large multi-modal models (LMMs) capable of
simultaneously processing inputs of various modalities such as text, images, video, and …