A fast post-training pruning framework for transformers
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …
However, prior work on pruning Transformers requires retraining the models. This can add …
Accurate post training quantization with small calibration sets
Lately, post-training quantization methods have gained considerable attention, as they are
simple to use, and require only a small unlabeled calibration set. This small dataset cannot …
simple to use, and require only a small unlabeled calibration set. This small dataset cannot …
Structural pruning via latency-saliency knapsack
Structural pruning can simplify network architecture and improve inference speed. We
propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a …
propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a …
Coaching a teachable student
We propose a novel knowledge distillation framework for effectively teaching a sensorimotor
student agent to drive from the supervision of a privileged teacher agent. Current distillation …
student agent to drive from the supervision of a privileged teacher agent. Current distillation …
Improving post training neural quantization: Layer-wise calibration and integer programming
Lately, post-training quantization methods have gained considerable attention, as they are
simple to use, and require only a small unlabeled calibration set. This small dataset cannot …
simple to use, and require only a small unlabeled calibration set. This small dataset cannot …
Fp-agl: Filter pruning with adaptive gradient learning for accelerating deep convolutional neural networks
Filter pruning is a technique that reduces computational complexity, inference time, and
memory footprint by removing unnecessary filters in convolutional neural networks (CNNs) …
memory footprint by removing unnecessary filters in convolutional neural networks (CNNs) …
SPDY: Accurate pruning with speedup guarantees
The recent focus on the efficiency of deep neural networks (DNNs) has led to significant
work on model compression approaches, of which weight pruning is one of the most …
work on model compression approaches, of which weight pruning is one of the most …
Hardcore-nas: Hard constrained differentiable neural architecture search
Realistic use of neural networks often requires adhering to multiple constraints on latency,
energy and memory among others. A popular approach to find fitting networks is through …
energy and memory among others. A popular approach to find fitting networks is through …
Enhanced sparsification via stimulative training
Sparsification-based pruning has been an important category in model compression.
Existing methods commonly set sparsity-inducing penalty terms to suppress the importance …
Existing methods commonly set sparsity-inducing penalty terms to suppress the importance …
What can we learn from the selective prediction and uncertainty estimation performance of 523 imagenet classifiers
When deployed for risk-sensitive tasks, deep neural networks must include an uncertainty
estimation mechanism. Here we examine the relationship between deep architectures and …
estimation mechanism. Here we examine the relationship between deep architectures and …