- Academic Search

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Speichern Zitieren Zitiert von: 67 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] github.io

[PDF][PDF] Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, C Sun, J Huang… - arxiv preprint arxiv …, 2024 - minjiazhang.github.io

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Speichern Zitieren Zitiert von: 22 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D **e… - arxiv preprint arxiv …, 2024 - arxiv.org

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

Speichern Zitieren Zitiert von: 10 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations

A Agrawal, J Chen, Í Goiri, R Ramjee, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

As large language models (LLMs) evolve to handle increasingly longer contexts, serving
inference requests for context lengths in the range of millions of tokens presents unique …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] usenix.org

Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}

A Agrawal, N Kedia, A Panwar, J Mohan… - … USENIX Symposium on …, 2024 - usenix.org

Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …

Speichern Zitieren Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Pod-attention: Unlocking full prefill-decode overlap for faster llm inference

AK Kamath, R Prabhu, J Mohan, S Peter… - arxiv preprint arxiv …, 2024 - arxiv.org

Each request in LLM inference goes through two phases: compute-bound prefill and
memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Z Li, Z Chen, R Delacourt, G Oliaro, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper introduces AdaServe, the first LLM serving system to support SLO customization
through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to …

Speichern Zitieren Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity

X Lv, J Cao, S Guan, X Zhou, Z Qi, Y Zang, M Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Scaling-law has guided the language model designing for past years, however, it is worth
noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] openreview.net

ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

R Chen, Z Wang, B Cao, T Wu, S Zheng, X Li… - The Thirty-eighth Annual … - openreview.net

Large Language Models (LLMs) are widely used in today's tasks of natural language
processing. To support applications like multi-turn chats, document understanding, and …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

DeServe: Towards Affordable Offline LLM Inference via Decentralization

L Wu, X Liu, T Shi, Z Ye, D Song - arxiv preprint arxiv:2501.14784, 2025 - arxiv.org

The rapid growth of generative AI and its integration into everyday workflows have
significantly increased the demand for large language model (LLM) inference services …

Speichern Zitieren Ähnliche Artikel HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Accelerating self-attentions for llm serving with flashinfer

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

[PDF][PDF] Sglang: Efficient execution of structured language model programs

Nanoflow: Towards optimal large language model serving throughput

Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations

Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}

Pod-attention: Unlocking full prefill-decode overlap for faster llm inference

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity

ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

DeServe: Towards Affordable Offline LLM Inference via Decentralization