[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

[PDF][PDF] Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, C Sun, J Huang… - arxiv preprint arxiv …, 2024 - minjiazhang.github.io
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D **e… - arxiv preprint arxiv …, 2024 - arxiv.org
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations

A Agrawal, J Chen, Í Goiri, R Ramjee, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) evolve to handle increasingly longer contexts, serving
inference requests for context lengths in the range of millions of tokens presents unique …

Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}

A Agrawal, N Kedia, A Panwar, J Mohan… - … USENIX Symposium on …, 2024 - usenix.org
Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …

Pod-attention: Unlocking full prefill-decode overlap for faster llm inference

AK Kamath, R Prabhu, J Mohan, S Peter… - arxiv preprint arxiv …, 2024 - arxiv.org
Each request in LLM inference goes through two phases: compute-bound prefill and
memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid …

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Z Li, Z Chen, R Delacourt, G Oliaro, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces AdaServe, the first LLM serving system to support SLO customization
through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to …

MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity

X Lv, J Cao, S Guan, X Zhou, Z Qi, Y Zang, M Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling-law has guided the language model designing for past years, however, it is worth
noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following …

ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

R Chen, Z Wang, B Cao, T Wu, S Zheng, X Li… - The Thirty-eighth Annual … - openreview.net
Large Language Models (LLMs) are widely used in today's tasks of natural language
processing. To support applications like multi-turn chats, document understanding, and …

DeServe: Towards Affordable Offline LLM Inference via Decentralization

L Wu, X Liu, T Shi, Z Ye, D Song - arxiv preprint arxiv:2501.14784, 2025 - arxiv.org
The rapid growth of generative AI and its integration into everyday workflows have
significantly increased the demand for large language model (LLM) inference services …