Microscaling data formats for deep learning
BD Rouhani, R Zhao, A More, M Hall… - ar** and magnitude-aware differentiation for improved quantization-aware training
Data clip** is crucial in reducing noise in quantization operations and improving the
achievable accuracy of quantization-aware training (QAT). Current practices rely on …
achievable accuracy of quantization-aware training (QAT). Current practices rely on …
With shared microexponents, a little shifting goes a long way
This paper introduces Block Data Representations (BDR), a framework for exploring and
evaluating a wide spectrum of narrow-precision formats for deep learning. It enables …
evaluating a wide spectrum of narrow-precision formats for deep learning. It enables …
Computers Can Learn from the Heuristic Designs and Master Internet Congestion Control
In this work, for the first time, we demonstrate that computers can automatically learn from
observing the heuristic efforts of the last four decades, stand on the shoulders of the existing …
observing the heuristic efforts of the last four decades, stand on the shoulders of the existing …
A 95.6-TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm
The energy efficiency of deep neural network (DNN) inference can be improved with custom
accelerators. DNN inference accelerators often employ specialized hardware techniques to …
accelerators. DNN inference accelerators often employ specialized hardware techniques to …
Daq: Channel-wise distribution-aware quantization for deep image super-resolution networks
Since the resurgence of deep neural networks (DNNs), image super-resolution (SR) has
recently seen a huge progress in improving the quality of low resolution images, however at …
recently seen a huge progress in improving the quality of low resolution images, however at …
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks
M Neseem, C McCullough, R Hsin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Low-precision quantization is recognized for its efficacy in neural network optimization. Our
analysis reveals that non-quantized elementwise operations which are prevalent in layers …
analysis reveals that non-quantized elementwise operations which are prevalent in layers …
Pareto-optimal quantized resnet is mostly 4-bit
Quantization has become a popular technique to compress neural networks and reduce
compute cost, but most prior work focuses on studying quantization without changing the …
compute cost, but most prior work focuses on studying quantization without changing the …
Model compression and efficient inference for large language models: A survey
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …
the significant memory and computational costs incurred during the inference process make …