Large language monkeys: Scaling inference compute with repeated sampling
Scaling the amount of compute used to train language models has dramatically improved
their capabilities. However, when it comes to inference, we often limit the amount of compute …
their capabilities. However, when it comes to inference, we often limit the amount of compute …
Sglang: Efficient execution of structured language model programs
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …
[PDF][PDF] Efficiently Programming Large Language Models using SGLang.
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …
Instinfer: In-storage attention offloading for cost-effective long-context llm inference
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
New solutions on LLM acceleration, optimization, and application
Large Language Models (LLMs) have revolutionized a wide range of applications with their
strong human-like understanding and creativity. Due to the continuously growing model size …
strong human-like understanding and creativity. Due to the continuously growing model size …
Shadowkv: Kv cache in shadows for high-throughput long-context llm inference
With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …
been a growing demand for efficient support of high-throughput inference. However, as the …
Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching
Large language models (LLMs) increasingly play an important role in a wide range of
information processing and management tasks. Many of these tasks are performed in large …
information processing and management tasks. Many of these tasks are performed in large …
Optimizing llm queries in relational workloads
Analytical database providers (eg, Redshift, Databricks, BigQuery) have rapidly added
support for invoking Large Language Models (LLMs) through native user-defined functions …
support for invoking Large Language Models (LLMs) through native user-defined functions …
Compute or load kv cache? why not both?
Recent advancements in Large Language Models (LLMs) have significantly increased
context window sizes, enabling sophisticated applications but also introducing substantial …
context window sizes, enabling sophisticated applications but also introducing substantial …
Scbench: A kv cache-centric analysis of long-context methods
Long-context LLMs have enabled numerous downstream applications but also introduced
significant challenges related to computational and memory efficiency. To address these …
significant challenges related to computational and memory efficiency. To address these …