Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arxiv preprint arxiv …, 2024 - arxiv.org
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Z Wan, Z Wu, C Liu, J Huang, Z Zhu, P **… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …

A survey of mamba

H Qu, L Ning, R An, W Fan, T Derr, H Liu, X Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
As one of the most representative DL techniques, Transformer architecture has empowered
numerous advanced models, especially the large language models (LLMs) that comprise …

Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference

H Dong, X Yang, Z Zhang, Z Wang, Y Chi… - arxiv preprint arxiv …, 2024 - arxiv.org
Many computational factors limit broader deployment of large language models. In this
paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a …

Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

J Yuan, H Liu, S Zhong, YN Chuang, S Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Long context capability is a crucial competency for large language models (LLMs) as it
mitigates the human struggle to digest long-form texts. This capability enables complex task …

Human-like episodic memory for infinite context llms

Z Fountas, MA Benfeghoul, A Oomerjee… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have shown remarkable capabilities, but still struggle with
processing extensive contexts, limiting their ability to maintain coherence and accuracy over …

Lazyllm: Dynamic token pruning for efficient long context llm inference

Q Fu, M Cho, T Merth, S Mehta, M Rastegari… - arxiv preprint arxiv …, 2024 - arxiv.org
The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …

D2o: Dynamic discriminative operations for efficient generative inference of large language models

Z Wan, X Wu, Y Zhang, Y **n, C Tao, Z Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory
demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache …

A deeper look at depth pruning of llms

SA Siddiqui, X Dong, G Heinrich, T Breuel… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) are not only resource-intensive to train but even more
costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs …