- Academic Search

R Prabhu, A Nayak, J Mohan, R Ramjee… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Efficient management of GPU memory is essential for high throughput LLM inference. Prior
systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity …‏

שמור צטט צוטט על ידי 10 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tackling the dynamicity in a production llm serving system with sota optimizations via hybrid prefill/decode/verify scheduling on efficient meta-kernels‏

M Song, X Tang, F Hou, J Li, W Wei, Y Ma… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Meeting growing demands for low latency and cost efficiency in production-grade large
language model (LLM) serving systems requires integrating advanced optimization …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards Efficient Large Multimodal Model Serving‏

H Qiu, A Biswas, Z Zhao, J Mohan, A Khare… - arxiv preprint arxiv …, 2025‏ - arxiv.org‏

Recent advances in generative AI have led to large multi-modal models (LMMs) capable of
simultaneously processing inputs of various modalities such as text, images, video, and …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Pod-attention: Unlocking full prefill-decode overlap for faster llm inference

vattention: Dynamic memory management for serving llms without pagedattention‏

Tackling the dynamicity in a production llm serving system with sota optimizations via hybrid prefill/decode/verify scheduling on efficient meta-kernels‏

Towards Efficient Large Multimodal Model Serving‏