Retrievalattention: Accelerating long-context llm inference via vector retrieval

D Liu, M Chen, B Lu, H Jiang, Z Han, Q Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer-based Large Language Models (LLMs) have become increasingly important.
However, due to the quadratic time complexity of attention computation, scaling LLMs to …

Magicpig: Lsh sampling for efficient llm generation

Z Chen, R Sadhukhan, Z Ye, Y Zhou, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) with long context windows have gained significant attention.
However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various …

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

H Sun, LW Chang, W Bao, S Zheng, N Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
With the widespread deployment of long-context large language models (LLMs), there has
been a growing demand for efficient support of high-throughput inference. However, as the …

Squeezed attention: Accelerating long context length llm inference

C Hooper, S Kim, H Mohammadzadeh… - arxiv preprint arxiv …, 2024 - arxiv.org
Emerging Large Language Model (LLM) applications require long input prompts to perform
complex downstream tasks like document analysis and code generation. For these long …

A survey on large language model acceleration based on kv cache management

H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have revolutionized a wide range of domains such as
natural language processing, computer vision, and multi-modal tasks due to their ability to …

Adaptlink: A heterogeneity-aware adaptive framework for distributed mllm inference

X Hu, Z Chen, K Guo, M Zhang… - AAAI 2025 Workshop on …, 2025 - openreview.net
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance
in tasks such as commonsense reasoning and visual scene understanding. Despite their …

Data Proportion Detection for Optimized Data Management for Large Language Models

H Liang, K Zhao, Y Yang, B Cui, G Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated exceptional performance across a wide
range of tasks and domains, with data preparation playing a critical role in achieving these …

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

X Li, Z **ng, Y Li, L Qu, HL Zhen, W Liu, Y Yao… - arxiv preprint arxiv …, 2025 - arxiv.org
KV cache quantization can improve Large Language Models (LLMs) inference throughput
and latency in long contexts and large batch-size scenarios while preserving LLMs …

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

Q Yang, J Wang, X Li, Z Wang, C Chen, L Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
With the development of large language models (LLMs), efficient inference through Key-
Value (KV) cache compression has attracted considerable attention, especially for long …

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

W Li, Z Wang, Y Gu, G Yu - arxiv preprint arxiv:2412.05896, 2024 - arxiv.org
Recently the generative Large Language Model (LLM) has achieved remarkable success in
numerous applications. Notably its inference generates output tokens one-by-one, leading …