Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks

Z Wang, B **, Z Yu, M Zhang - arxiv preprint arxiv:2407.08454, 2024 - arxiv.org
How to efficiently serve Large Language Models (LLMs) has become a pressing issue
because of their huge computational cost in their autoregressive generation process. To …

InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing

Z Zhu, X Cheng, Z Chen, Y Chen, Y Zhang… - Proceedings of the …, 2024 - dl.acm.org
Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse
modalities, which has received widespread attention in dialogue systems. Despite the …

Famba-v: Fast vision mamba with cross-layer token fusion

H Shen, Z Wan, X Wang, M Zhang - arxiv preprint arxiv:2409.09808, 2024 - arxiv.org
Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

W Wu, Z Pan, C Wang, L Chen, Y Bai, K Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
With the development of large language models (LLMs), the ability to handle longer contexts
has become a key capability for Web applications such as cross-document understanding …

A survey on large language model acceleration based on kv cache management

H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have revolutionized a wide range of domains such as
natural language processing, computer vision, and multi-modal tasks due to their ability to …

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

J **ong, J Shen, F Ye, C Tao, Z Wan, J Lu, X Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Deploying large language models (LLMs) is challenging due to their high memory and
computational demands, especially during long-context inference. While key-value (KV) …

ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

M Zhong, X Liu, C Zhang, Y Lei, Y Gao, Y Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language models (LLMs) have become a research hotspot. To accelerate the
inference of LLMs, storing computed caches in memory has become the standard technique …

[HTML][HTML] 1. 2.1 Transformer Architecture

H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu, W Dong… - gpapi.fmnfoods.com
Abstract Large Language Models (LLMs) have revolutionized a wide range of domains such
as natural language processing, computer vision, and multi-modal tasks due to their ability …