Google Академик

A Liu, J Liu, Z Pan, Y He, R Haffari… - Advances in Neural …, 2025 - proceedings.neurips.cc

A critical approach for efficiently deploying computationally demanding large language
models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of …

Сачувај Цитирај 32 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The unreasonable ineffectiveness of the deeper layers

A Gromov, K Tirumala, H Shapourian… - arxiv preprint arxiv …, 2024 - arxiv.org

We empirically study a simple layer-pruning strategy for popular families of open-weight
pretrained LLMs, finding minimal degradation of performance on different question …

Сачувај Цитирај 76 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Сачувај Цитирај 9 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arxiv preprint arxiv:2406.16858, 2024 - arxiv.org

Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Сачувај Цитирај 28 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Knowledge circuits in pretrained transformers

Y Yao, N Zhang, Z **, M Wang, Z Xu… - Advances in Neural …, 2025 - proceedings.neurips.cc

The remarkable capabilities of modern large language models are rooted in their vast
repositories of knowledge encoded within their parameters, enabling them to perceive the …

Сачувај Цитирај 7 пута наведен Сродни чланци Све верзије (5) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2408.13233, 2024 - arxiv.org

The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

Сачувај Цитирај 28 пута наведен Сродни чланци Све верзије (5) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A tighter complexity analysis of sparsegpt

X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2408.12151, 2024 - arxiv.org

In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …

Сачувај Цитирај 22 пута наведен Сродни чланци Све верзије (5) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao… - Advances in …, 2025 - proceedings.neurips.cc

A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Сачувај Цитирај 3 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lazyllm: Dynamic token pruning for efficient long context llm inference

Q Fu, M Cho, T Merth, S Mehta, M Rastegari… - arxiv preprint arxiv …, 2024 - arxiv.org

The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …

Сачувај Цитирај 16 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Challenges in deploying long-context transformers: A theoretical peak performance analysis

Y Fu - arxiv preprint arxiv:2405.08944, 2024 - arxiv.org

Transformer-based long context generative models power emerging AI applications like
hour-long video understanding and project-level coding agent. Deploying long context …

Сачувај Цитирај 16 пута наведен Сродни чланци Све верзије (2) HTML верзија

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

LayerSkip: Enabling early exit inference and self-speculative decoding

Minicache: Kv cache compression in depth dimension for large language models

The unreasonable ineffectiveness of the deeper layers

Large language model inference acceleration: A comprehensive hardware perspective

Eagle-2: Faster inference of language models with dynamic draft trees

Knowledge circuits in pretrained transformers

Multi-layer transformers gradient can be approximated in almost linear time

A tighter complexity analysis of sparsegpt

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

Lazyllm: Dynamic token pruning for efficient long context llm inference

Challenges in deploying long-context transformers: A theoretical peak performance analysis