Unifying kv cache compression for large language models with leankv

Y Zhang, Y Hu, R Zhao, J Lui, H Chen - arxiv preprint arxiv:2412.03131, 2024 - arxiv.org
Large language models (LLMs) demonstrate exceptional performance but incur high serving
costs due to substantial memory demands, with the key-value (KV) cache being a primary …

Seerattention: Learning intrinsic sparse attention in your llms

Y Gao, Z Zeng, D Du, S Cao, HKH So, T Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic
complexity limits the efficiency and scalability of LLMs, especially for those with a long …

GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

C Tang, B Lv, Z Zheng, B Yang, K Zhao, N Liao… - arxiv preprint arxiv …, 2025 - arxiv.org
Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert
models as opposed to a single large network. However, these experts typically operate …

The Future of AI: Exploring the Potential of Large Concept Models

H Ahmad, D Goel - arxiv preprint arxiv:2501.05487, 2025 - arxiv.org
The field of Artificial Intelligence (AI) continues to drive transformative innovations, with
significant progress in conversational interfaces, autonomous vehicles, and intelligent …

Attention heads of large language models

Z Zheng, Y Wang, Y Huang, S Song, M Yang, B Tang… - Patterns - cell.com
Large language models (LLMs) have demonstrated performance approaching human levels
in tasks such as long-text comprehension and mathematical reasoning, but they remain …

Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

T Chen, U Evci, Y Ioannou, B Isik, S Liu… - ICLR 2025 Workshop … - openreview.net
Large Language Models (LLMs) have emerged as transformative tools in both research and
industry, excelling across a wide array of tasks. However, their growing computational …

[PDF][PDF] Tutorial Proposal: Efficient Inference for Large Language Models–Algorithm, Model, and System

X Ning, G Dai, H Bai, L Hou, Y Wang, Q Liu - nics-effalg.com
Background. Large Language Models (LLMs) have attracted significant attention from both
academia and industry in recent years. They are revolutionizing many applications …