A survey of techniques for optimizing transformer inference
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …
transformer neural networks. The family of transformer networks, including Bidirectional …
Towards unified deep image deraining: A survey and a new benchmark
Recent years have witnessed significant advances in image deraining due to the kinds of
effective image priors and deep learning models. As each deraining approach has …
effective image priors and deep learning models. As each deraining approach has …
Llmlingua: Compressing prompts for accelerated inference of large language models
Large language models (LLMs) have been applied in various applications due to their
astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) …
astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) …
Evo-vit: Slow-fast token evolution for dynamic vision transformer
Vision transformers (ViTs) have recently received explosive popularity, but the huge
computational cost is still a severe issue. Since the computation complexity of ViT is …
computational cost is still a severe issue. Since the computation complexity of ViT is …
Less is more: Focus attention for efficient detr
DETR-like models have significantly boosted the performance of detectors and even
outperformed classical convolutional models. However, all tokens are treated equally …
outperformed classical convolutional models. However, all tokens are treated equally …
The optimal bert surgeon: Scalable and accurate second-order pruning for large language models
Transformer-based language models have become a key building block for natural
language processing. While these models are extremely accurate, they can be too large and …
language processing. While these models are extremely accurate, they can be too large and …
Full stack optimization of transformer inference: a survey
Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …
Transformer models. These models achieve superior accuracy across a wide range of …
Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer
High-resolution images enable neural networks to learn richer visual representations.
However, this improved performance comes at the cost of growing computational …
However, this improved performance comes at the cost of growing computational …
Model tells you what to discard: Adaptive kv cache compression for llms
In this study, we introduce adaptive KV cache compression, a plug-and-play method that
reduces the memory footprint of generative inference for Large Language Models (LLMs) …
reduces the memory footprint of generative inference for Large Language Models (LLMs) …
Dynamic context pruning for efficient and interpretable autoregressive transformers
Abstract Autoregressive Transformers adopted in Large Language Models (LLMs) are hard
to scale to long sequences. Despite several works trying to reduce their computational cost …
to scale to long sequences. Despite several works trying to reduce their computational cost …