A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

A comprehensive review of binary neural network

C Yuan, SS Agaian - Artificial Intelligence Review, 2023 - Springer
Deep learning (DL) has recently changed the development of intelligent systems and is
widely adopted in many real-life applications. Despite their various benefits and potentials …

Llm-qat: Data-free quantization aware training for large language models

Z Liu, B Oguz, C Zhao, E Chang, P Stock… - arxiv preprint arxiv …, 2023 - arxiv.org
Several post-training quantization methods have been applied to large language models
(LLMs), and have been shown to perform well down to 8-bits. We find that these methods …

Pb-llm: Partially binarized large language models

Y Shang, Z Yuan, Q Wu, Z Dong - arxiv preprint arxiv:2310.00034, 2023 - arxiv.org
This paper explores network binarization, a radical form of quantization, compressing model
weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to …

Bibench: Benchmarking and analyzing network binarization

H Qin, M Zhang, Y Ding, A Li, Z Cai… - International …, 2023 - proceedings.mlr.press
Network binarization emerges as one of the most promising compression approaches
offering extraordinary computation and memory savings by minimizing the bit-width …

Bivit: Extremely compressed binary vision transformers

Y He, Z Lou, L Zhang, J Liu, W Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Model binarization can significantly compress model size, reduce energy
consumption, and accelerate inference through efficient bit-wise operations. Although …

Binaryvit: Pushing binary vision transformers towards convolutional models

PHC Le, X Li - Proceedings of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
With the increasing popularity and the increasing size of vision transformers (ViTs), there
has been an increasing interest in making them more efficient and less computationally …

[PDF][PDF] Scalable matmul-free language modeling

RJ Zhu, Y Zhang, E Sifferman, T Sheaves… - arxiv preprint arxiv …, 2024 - openreview.net
Matrix multiplication (MatMul) typically dominates the overall computational cost of large
language models (LLMs). This cost only grows as LLMs scale to larger embedding …

Db-llm: Accurate dual-binarization for efficient llms

H Chen, C Lv, L Ding, H Qin, X Zhou, Y Ding… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have significantly advanced the field of natural language
processing, while the expensive memory and computation consumption impede their …

Shiftaddvit: Mixture of multiplication primitives towards efficient vision transformer

H You, H Shi, Y Guo, Y Lin - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract Vision Transformers (ViTs) have shown impressive performance and have become
a unified backbone for multiple vision tasks. However, both the attention mechanism and …