Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arxiv preprint arxiv …, 2024 - arxiv.org
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arxiv preprint arxiv:2406.16858, 2024 - arxiv.org
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

A tighter complexity analysis of sparsegpt

X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2408.12151, 2024 - arxiv.org
In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …

Magicpig: Lsh sampling for efficient llm generation

Z Chen, R Sadhukhan, Z Ye, Y Zhou, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) with long context windows have gained significant attention.
However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various …

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

H Sun, LW Chang, W Bao, S Zheng, N Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …

Recycled Attention: Efficient inference for long-context language models

F Xu, T Goyal, E Choi - arxiv preprint arxiv:2411.05787, 2024 - arxiv.org
Generating long sequences of tokens given a long-context input imposes a heavy
computational burden for large language models (LLMs). One of the computational …

A theoretical perspective for speculative decoding algorithm

M Yin, M Chen, K Huang, M Wang - arxiv preprint arxiv:2411.00841, 2024 - arxiv.org
Transformer-based autoregressive sampling has been the major bottleneck for slowing
down large language model inferences. One effective way to accelerate inference is\emph …

Scbench: A kv cache-centric analysis of long-context methods

Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context LLMs have enabled numerous downstream applications but also introduced
significant challenges related to computational and memory efficiency. To address these …

Seed: Accelerating reasoning tree construction via scheduled speculative decoding

Z Wang, J Wu, Y Lai, C Zhang, D Zhou - arxiv preprint arxiv:2406.18200, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate remarkable emergent abilities across various
tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based …