- Academic Search

B Brown, J Juravsky, R Ehrlich, R Clark, QV Le… - arxiv preprint arxiv …, 2024 - arxiv.org

Scaling the amount of compute used to train language models has dramatically improved
their capabilities. However, when it comes to inference, we often limit the amount of compute …

Speichern Zitieren Zitiert von: 107 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, CL Sun… - Advances in …, 2025 - proceedings.neurips.cc

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Speichern Zitieren Zitiert von: 27 Ähnliche Artikel HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] nsf.gov

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Speichern Zitieren Zitiert von: 69 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Speichern Zitieren Zitiert von: 9 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

New solutions on LLM acceleration, optimization, and application

Y Huang, LJ Wan, H Ye, M Jha, J Wang, Y Li… - Proceedings of the 61st …, 2024 - dl.acm.org

Large Language Models (LLMs) have revolutionized a wide range of applications with their
strong human-like understanding and creativity. Due to the continuously growing model size …

Speichern Zitieren Zitiert von: 8 Ähnliche Artikel

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

H Sun, LW Chang, W Bao, S Zheng, N Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org

With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching

Z Zheng, X Ji, T Fang, F Zhou, C Liu, G Peng - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) increasingly play an important role in a wide range of
information processing and management tasks. Many of these tasks are performed in large …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Optimizing llm queries in relational workloads

S Liu, A Biswal, A Cheng, X Mo, S Cao… - arxiv preprint arxiv …, 2024 - arxiv.org

Analytical database providers (eg, Redshift, Databricks, BigQuery) have rapidly added
support for invoking Large Language Models (LLMs) through native user-defined functions …

Speichern Zitieren Zitiert von: 14 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Compute or load kv cache? why not both?

S **, X Liu, Q Zhang, ZM Mao - arxiv preprint arxiv:2410.03065, 2024 - arxiv.org

Recent advancements in Large Language Models (LLMs) have significantly increased
context window sizes, enabling sophisticated applications but also introducing substantial …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scbench: A kv cache-centric analysis of long-context methods

Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context LLMs have enabled numerous downstream applications but also introduced
significant challenges related to computational and memory efficiency. To address these …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 2 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Large language monkeys: Scaling inference compute with repeated sampling

Sglang: Efficient execution of structured language model programs

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

New solutions on LLM acceleration, optimization, and application

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching

Optimizing llm queries in relational workloads

Compute or load kv cache? why not both?

Scbench: A kv cache-centric analysis of long-context methods