Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing

Y Yang, Z Cao, Q Chen, L Qin, D Yang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
The development of large language models (LLMs) has significantly expanded model sizes,
resulting in substantial GPU memory requirements during inference. The key and value …

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

J Wang, Z Liang, X Zhang, N Cheng, J **ao - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, Transformer networks have shown remarkable performance in speech
recognition tasks. However, their deployment poses challenges due to high computational …

UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices

SK Yeom, TH Kim - arxiv preprint arxiv:2412.02344, 2024 - arxiv.org
Transformer-based architectures have demonstrated remarkable success across various
domains, but their deployment on edge devices remains challenging due to high memory …