Google Acadèmic

D Liu, M Chen, B Lu, H Jiang, Z Han, Q Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Transformer-based Large Language Models (LLMs) have become increasingly important.
However, due to the quadratic time complexity of attention computation, scaling LLMs to …

Desa Cita Citat per 16 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Magicpig: Lsh sampling for efficient llm generation

Z Chen, R Sadhukhan, Z Ye, Y Zhou, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) with long context windows have gained significant attention.
However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various …

Desa Cita Citat per 8 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

H Sun, LW Chang, W Bao, S Zheng, N Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org

With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …

Desa Cita Citat per 4 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Squeezed attention: Accelerating long context length llm inference

C Hooper, S Kim, H Mohammadzadeh… - arxiv preprint arxiv …, 2024 - arxiv.org

Emerging Large Language Model (LLM) applications require long input prompts to perform
complex downstream tasks like document analysis and code generation. For these long …

Desa Cita Citat per 3 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on large language model acceleration based on kv cache management

H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have revolutionized a wide range of domains such as
natural language processing, computer vision, and multi-modal tasks due to their ability to …

Desa Cita Citat per 3 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Adaptlink: A heterogeneity-aware adaptive framework for distributed mllm inference

X Hu, Z Chen, K Guo, M Zhang… - AAAI 2025 Workshop on …, 2025 - openreview.net

Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance
in tasks such as commonsense reasoning and visual scene understanding. Despite their …

Desa Cita Citat per 2 Articles relacionats Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Data Proportion Detection for Optimized Data Management for Large Language Models

H Liang, K Zhao, Y Yang, B Cui, G Dong… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have demonstrated exceptional performance across a wide
range of tasks and domains, with data preparation playing a critical role in achieving these …

Desa Cita Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

X Li, Z **ng, Y Li, L Qu, HL Zhen, W Liu, Y Yao… - arxiv preprint arxiv …, 2025 - arxiv.org

KV cache quantization can improve Large Language Models (LLMs) inference throughput
and latency in long contexts and large batch-size scenarios while preserving LLMs …

Desa Cita Articles relacionats Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

Q Yang, J Wang, X Li, Z Wang, C Chen, L Chen… - arxiv preprint arxiv …, 2025 - arxiv.org

With the development of large language models (LLMs), efficient inference through Key-
Value (KV) cache compression has attracted considerable attention, especially for long …

Desa Cita Articles relacionats Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

W Li, Z Wang, Y Gu, G Yu - arxiv preprint arxiv:2412.05896, 2024 - arxiv.org

Recently the generative Large Language Model (LLM) has achieved remarkable success in
numerous applications. Notably its inference generates output tokens one-by-one, leading …

Desa Cita Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

Crea una alerta

Cita

Cerca avançada

S'ha desat a La meva biblioteca

Pqcache: Product quantization-based kvcache for long context llm inference

Retrievalattention: Accelerating long-context llm inference via vector retrieval

Magicpig: Lsh sampling for efficient llm generation

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

Squeezed attention: Accelerating long context length llm inference

A survey on large language model acceleration based on kv cache management

Adaptlink: A heterogeneity-aware adaptive framework for distributed mllm inference

Data Proportion Detection for Optimized Data Management for Large Language Models

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference