A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations
Modern deep neural networks, particularly recent large language models, come with
massive model sizes that require significant computational and storage resources. To …
massive model sizes that require significant computational and storage resources. To …
A survey of techniques for optimizing transformer inference
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …
transformer neural networks. The family of transformer networks, including Bidirectional …
Spvit: Enabling faster vision transformers via latency-aware soft token pruning
Abstract Recently, Vision Transformer (ViT) has continuously established new milestones in
the computer vision field, while the high computation and memory cost makes its …
the computer vision field, while the high computation and memory cost makes its …
Green ai: Do deep learning frameworks have different costs?
The use of Artificial Intelligence (ai), and more specifically of Deep Learning (dl), in modern
software systems, is nowadays widespread and continues to grow. At the same time, its …
software systems, is nowadays widespread and continues to grow. At the same time, its …
Compressing large-scale transformer-based models: A case study on bert
Pre-trained Transformer-based models have achieved state-of-the-art performance for
various Natural Language Processing (NLP) tasks. However, these models often have …
various Natural Language Processing (NLP) tasks. However, these models often have …
Pruning self-attentions into convolutional layers in single path
Vision Transformers (ViTs) have achieved impressive performance over various computer
vision tasks. However, modeling global correlations with multi-head self-attention (MSA) …
vision tasks. However, modeling global correlations with multi-head self-attention (MSA) …
A survey of deep learning on cpus: opportunities and co-optimizations
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL)
workloads in systems ranging from mobile to extreme-end servers. In this article, we present …
workloads in systems ranging from mobile to extreme-end servers. In this article, we present …
Accelerating framework of transformer by hardware design and model compression co-optimization
P Qi, EHM Sha, Q Zhuge, H Peng… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
State-of-the-art Transformer-based models, with gigantic parameters, are difficult to be
accommodated on resource constrained embedded devices. Moreover, with the …
accommodated on resource constrained embedded devices. Moreover, with the …
Gradient-free structured pruning with unlabeled data
Abstract Large Language Models (LLMs) have achieved great success in solving difficult
tasks across many domains, but such success comes with a high computation cost, and …
tasks across many domains, but such success comes with a high computation cost, and …
Joint structured pruning and dense knowledge distillation for efficient transformer model compression
In this paper, we develop a novel Joint Model Compression (referred to as JMC) method by
combining structured pruning and dense knowledge distillation techniques to significantly …
combining structured pruning and dense knowledge distillation techniques to significantly …