Google znalac

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Spremi Citiraj Spominje se 15 puta Srodni članci Svih 3 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

M\'elange: Cost efficient large language model serving by exploiting gpu heterogeneity

T Griggs, X Liu, J Yu, D Kim, WL Chiang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are increasingly integrated into many online services, yet
they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances …

Spremi Citiraj Spominje se 12 puta Srodni članci Svih 2 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Cachegen: Kv cache compression and streaming for fast large language model serving

Y Liu, H Li, Y Cheng, S Ray, Y Huang… - Proceedings of the …, 2024 - dl.acm.org

As large language models (LLMs) take on complex tasks, their inputs are supplemented with
longer contexts that incorporate domain knowledge. Yet using long contexts is challenging …

Spremi Citiraj Spominje se 13 puta Srodni članci Svih 8 inačica

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Queue management for slo-oriented large language model serving

A Patke, D Reddy, S Jha, H Qiu, C Pinto… - Proceedings of the …, 2024 - dl.acm.org

Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …

Spremi Citiraj Spominje se 4 puta Srodni članci Svih 5 inačica

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Intelligent router for llm workloads: Improving performance through workload-aware scheduling

K Jain, A Parayil, A Mallick, E Choukse, X Qin… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Model (LLM) workloads have distinct prefill and decode phases with
different compute and memory requirements which should ideally be accounted for when …

Spremi Citiraj Spominje se 5 puta Srodni članci Svih 4 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SocialMind: LLM-based Proactive AR Social Assistive System with Human-like Perception for In-situ Live Interactions

B Yang, Y Guo, L Xu, Z Yan, H Chen, G **ng… - arxiv preprint arxiv …, 2024 - arxiv.org

Social interactions are fundamental to human life. The recent emergence of large language
models (LLMs)-based virtual assistants has demonstrated their potential to revolutionize …

Spremi Citiraj Spominje se 1 puta Srodni članci Svih 2 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Efficient LLM Scheduling by Learning to Rank

Y Fu, S Zhu, R Su, A Qiao, I Stoica, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

In Large Language Model (LLM) inference, the output length of an LLM request is typically
regarded as not known a priori. Consequently, most LLM serving systems employ a simple …

Spremi Citiraj Srodni članci Svih 4 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Z Li, Z Chen, R Delacourt, G Oliaro, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper introduces AdaServe, the first LLM serving system to support SLO customization
through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to …

Spremi Citiraj Srodni članci Svih 2 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

Z Wu, Z Zhou, A Verma, A Prakash, D Rus… - arxiv preprint arxiv …, 2025 - arxiv.org

We propose TETRIS, a novel method that optimizes the total throughput of batch speculative
decoding in multi-request settings. Unlike existing methods that optimize for a single request …

Spremi Citiraj Srodni članci Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View

Y Wu, I Hua, Y Ding - arxiv preprint arxiv:2502.11256, 2025 - arxiv.org

Large language models (LLMs) offer powerful capabilities but come with significant
environmental costs, particularly in carbon emissions. Existing studies benchmark these …

Spremi Citiraj Srodni članci Prikaži kao HTML

Stvori obavijest

Citiraj

Napredno pretraživanje

Spremljeno u Moju knjižnicu

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

Llm inference serving: Survey of recent advances and opportunities

M\'elange: Cost efficient large language model serving by exploiting gpu heterogeneity

Cachegen: Kv cache compression and streaming for fast large language model serving

Queue management for slo-oriented large language model serving

Intelligent router for llm workloads: Improving performance through workload-aware scheduling

SocialMind: LLM-based Proactive AR Social Assistive System with Human-like Perception for In-situ Live Interactions

Efficient LLM Scheduling by Learning to Rank

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View