Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …
significant barrier to their widespread deployment, especially as prompt lengths continue to …
Eagle-2: Faster inference of language models with dynamic draft trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …
and speculative sampling has proven to be an effective solution. Most speculative sampling …
Multi-layer transformers gradient can be approximated in almost linear time
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …
architectures poses significant challenges for training and inference, and becomes the …
A tighter complexity analysis of sparsegpt
In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …
Magicpig: Lsh sampling for efficient llm generation
Large language models (LLMs) with long context windows have gained significant attention.
However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various …
However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various …
Shadowkv: Kv cache in shadows for high-throughput long-context llm inference
With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …
been a growing demand for efficient support of high-throughput inference. However, as the …
Recycled Attention: Efficient inference for long-context language models
Generating long sequences of tokens given a long-context input imposes a heavy
computational burden for large language models (LLMs). One of the computational …
computational burden for large language models (LLMs). One of the computational …
A theoretical perspective for speculative decoding algorithm
Transformer-based autoregressive sampling has been the major bottleneck for slowing
down large language model inferences. One effective way to accelerate inference is\emph …
down large language model inferences. One effective way to accelerate inference is\emph …
Scbench: A kv cache-centric analysis of long-context methods
Long-context LLMs have enabled numerous downstream applications but also introduced
significant challenges related to computational and memory efficiency. To address these …
significant challenges related to computational and memory efficiency. To address these …
Seed: Accelerating reasoning tree construction via scheduled speculative decoding
Large Language Models (LLMs) demonstrate remarkable emergent abilities across various
tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based …
tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based …