Large language monkeys: Scaling inference compute with repeated sampling

B Brown, J Juravsky, R Ehrlich, R Clark, QV Le… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling the amount of compute used to train language models has dramatically improved
their capabilities. However, when it comes to inference, we often limit the amount of compute …

Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, CL Sun… - Advances in …, 2025 - proceedings.neurips.cc
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

New solutions on LLM acceleration, optimization, and application

Y Huang, LJ Wan, H Ye, M Jha, J Wang, Y Li… - Proceedings of the 61st …, 2024 - dl.acm.org
Large Language Models (LLMs) have revolutionized a wide range of applications with their
strong human-like understanding and creativity. Due to the continuously growing model size …

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

H Sun, LW Chang, W Bao, S Zheng, N Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching

Z Zheng, X Ji, T Fang, F Zhou, C Liu, G Peng - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) increasingly play an important role in a wide range of
information processing and management tasks. Many of these tasks are performed in large …

Optimizing llm queries in relational workloads

S Liu, A Biswal, A Cheng, X Mo, S Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Analytical database providers (eg, Redshift, Databricks, BigQuery) have rapidly added
support for invoking Large Language Models (LLMs) through native user-defined functions …

Compute or load kv cache? why not both?

S **, X Liu, Q Zhang, ZM Mao - arxiv preprint arxiv:2410.03065, 2024 - arxiv.org
Recent advancements in Large Language Models (LLMs) have significantly increased
context window sizes, enabling sophisticated applications but also introducing substantial …

Scbench: A kv cache-centric analysis of long-context methods

Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context LLMs have enabled numerous downstream applications but also introduced
significant challenges related to computational and memory efficiency. To address these …