Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks
How to efficiently serve Large Language Models (LLMs) has become a pressing issue
because of their huge computational cost in their autoregressive generation process. To …
because of their huge computational cost in their autoregressive generation process. To …
InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing
Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse
modalities, which has received widespread attention in dialogue systems. Despite the …
modalities, which has received widespread attention in dialogue systems. Despite the …
Famba-v: Fast vision mamba with cross-layer token fusion
Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
With the development of large language models (LLMs), the ability to handle longer contexts
has become a key capability for Web applications such as cross-document understanding …
has become a key capability for Web applications such as cross-document understanding …
A survey on large language model acceleration based on kv cache management
H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have revolutionized a wide range of domains such as
natural language processing, computer vision, and multi-modal tasks due to their ability to …
natural language processing, computer vision, and multi-modal tasks due to their ability to …
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
Deploying large language models (LLMs) is challenging due to their high memory and
computational demands, especially during long-context inference. While key-value (KV) …
computational demands, especially during long-context inference. While key-value (KV) …
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
Large Language models (LLMs) have become a research hotspot. To accelerate the
inference of LLMs, storing computed caches in memory has become the standard technique …
inference of LLMs, storing computed caches in memory has become the standard technique …
[HTML][HTML] 1. 2.1 Transformer Architecture
Abstract Large Language Models (LLMs) have revolutionized a wide range of domains such
as natural language processing, computer vision, and multi-modal tasks due to their ability …
as natural language processing, computer vision, and multi-modal tasks due to their ability …