Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing
The development of large language models (LLMs) has significantly expanded model sizes,
resulting in substantial GPU memory requirements during inference. The key and value …
resulting in substantial GPU memory requirements during inference. The key and value …
EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization
In recent years, Transformer networks have shown remarkable performance in speech
recognition tasks. However, their deployment poses challenges due to high computational …
recognition tasks. However, their deployment poses challenges due to high computational …
UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices
Transformer-based architectures have demonstrated remarkable success across various
domains, but their deployment on edge devices remains challenging due to high memory …
domains, but their deployment on edge devices remains challenging due to high memory …