Google Академія

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arxiv preprint arxiv …, 2024 - arxiv.org

The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Зберегти Послатися Цитовано в 48 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] radensa.ru

[PDF][PDF] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and …

F Wang, Z Zhang, X Zhang, Z Wu, T Mo, Q Lu… - arxiv preprint arxiv …, 2024 - ai.radensa.ru

Large language models (LLM) have demonstrated emergent abilities in text generation,
question answering, and reasoning, facilitating various tasks and domains. Despite their …

Зберегти Послатися Цитовано в 9 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Z Wan, Z Wu, C Liu, J Huang, Z Zhu, P **… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …

Зберегти Послатися Цитовано в 23 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey of mamba

H Qu, L Ning, R An, W Fan, T Derr, H Liu, X Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

As one of the most representative DL techniques, Transformer architecture has empowered
numerous advanced models, especially the large language models (LLMs) that comprise …

Зберегти Послатися Цитовано в 23 джерелах Пов’язані статті Кількість версій: 3 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference

H Dong, X Yang, Z Zhang, Z Wang, Y Chi… - arxiv preprint arxiv …, 2024 - arxiv.org

Many computational factors limit broader deployment of large language models. In this
paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a …

Зберегти Послатися Цитовано в 30 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

J Yuan, H Liu, S Zhong, YN Chuang, S Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Long context capability is a crucial competency for large language models (LLMs) as it
mitigates the human struggle to digest long-form texts. This capability enables complex task …

Зберегти Послатися Цитовано в 14 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Human-like episodic memory for infinite context llms

Z Fountas, MA Benfeghoul, A Oomerjee… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have shown remarkable capabilities, but still struggle with
processing extensive contexts, limiting their ability to maintain coherence and accuracy over …

Зберегти Послатися Цитовано в 15 джерелах Пов’язані статті Кількість версій: 3 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lazyllm: Dynamic token pruning for efficient long context llm inference

Q Fu, M Cho, T Merth, S Mehta, M Rastegari… - arxiv preprint arxiv …, 2024 - arxiv.org

The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …

Зберегти Послатися Цитовано в 15 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

D2o: Dynamic discriminative operations for efficient generative inference of large language models

Z Wan, X Wu, Y Zhang, Y **n, C Tao, Z Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

Efficient inference in Large Language Models (LLMs) is impeded by the growing memory
demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache …

Зберегти Послатися Цитовано в 9 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A deeper look at depth pruning of llms

SA Siddiqui, X Dong, G Heinrich, T Breuel… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) are not only resource-intensive to train but even more
costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs …

Зберегти Послатися Цитовано в 6 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Dynamic memory compression: Retrofitting llms for accelerated inference

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

[PDF][PDF] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and …

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

A survey of mamba

Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference

Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

Human-like episodic memory for infinite context llms

Lazyllm: Dynamic token pruning for efficient long context llm inference

D2o: Dynamic discriminative operations for efficient generative inference of large language models

A deeper look at depth pruning of llms