Minicache: Kv cache compression in depth dimension for large language models

A Liu, J Liu, Z Pan, Y He, R Haffari… - Advances in Neural …, 2025 - proceedings.neurips.cc
A critical approach for efficiently deploying computationally demanding large language
models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of …

The unreasonable ineffectiveness of the deeper layers

A Gromov, K Tirumala, H Shapourian… - arxiv preprint arxiv …, 2024 - arxiv.org
We empirically study a simple layer-pruning strategy for popular families of open-weight
pretrained LLMs, finding minimal degradation of performance on different question …

Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arxiv preprint arxiv:2406.16858, 2024 - arxiv.org
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Knowledge circuits in pretrained transformers

Y Yao, N Zhang, Z **, M Wang, Z Xu… - Advances in Neural …, 2025 - proceedings.neurips.cc
The remarkable capabilities of modern large language models are rooted in their vast
repositories of knowledge encoded within their parameters, enabling them to perceive the …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

A tighter complexity analysis of sparsegpt

X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2408.12151, 2024 - arxiv.org
In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao… - Advances in …, 2025 - proceedings.neurips.cc
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Lazyllm: Dynamic token pruning for efficient long context llm inference

Q Fu, M Cho, T Merth, S Mehta, M Rastegari… - arxiv preprint arxiv …, 2024 - arxiv.org
The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …

Challenges in deploying long-context transformers: A theoretical peak performance analysis

Y Fu - arxiv preprint arxiv:2405.08944, 2024 - arxiv.org
Transformer-based long context generative models power emerging AI applications like
hour-long video understanding and project-level coding agent. Deploying long context …