A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Dynamic neural network structure: A review for its theories and applications

J Guo, CLP Chen, Z Liu, X Yang - IEEE Transactions on Neural …, 2024 - ieeexplore.ieee.org
The dynamic neural network (DNN), in contrast to the static counterpart, offers numerous
advantages, such as improved accuracy, efficiency, and interpretability. These benefits stem …

Deepmad: Mathematical architecture design for deep convolutional neural network

X Shen, Y Wang, M Lin, Y Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com
The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in
various vision tasks, overshadowing the conventional CNN-based models. This ignites a few …

Packqvit: Faster sub-8-bit vision transformers via full and packed quantization on the mobile

P Dong, L Lu, C Wu, C Lyu, G Yuan… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract While Vision Transformers (ViTs) have undoubtedly made impressive strides in
computer vision (CV), their intricate network structures necessitate substantial computation …

Zero-TPrune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers

H Wang, B Dedhia, NK Jha - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Deployment of Transformer models on edge devices is becoming increasingly challenging
due to the exponentially growing inference cost that scales quadratically with the number of …

Agile-quant: Activation-guided quantization for faster inference of LLMs on the edge

X Shen, P Dong, L Lu, Z Kong, Z Li, M Lin… - Proceedings of the …, 2024 - ojs.aaai.org
Large Language Models (LLMs) stand out for their impressive performance in intricate
language modeling tasks. However, their demanding computational and memory needs …

SSR: Spatial sequential hybrid architecture for latency throughput tradeoff in transformer acceleration

J Zhuang, Z Yang, S Ji, H Huang, AK Jones… - Proceedings of the …, 2024 - dl.acm.org
With the increase in the computation intensity of the chip, the mismatch between
computation layer shapes and the available computation resource significantly limits the …

An integer-only and group-vector systolic accelerator for efficiently map** vision transformer on edge

M Huang, J Luo, C Ding, Z Wei… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Transformer-like network has shown remarkable high performance in both natural language
processing and computer vision. However, the huge computational demands in non-linear …

Lightening-transformer: A dynamically-operated optically-interconnected photonic transformer accelerator

H Zhu, J Gu, H Wang, Z Jiang, Z Zhang… - … Symposium on High …, 2024 - ieeexplore.ieee.org
The wide adoption and significant computing resource cost of attention-based transformers,
eg, Vision Transformers and large language models, have driven the demand for efficient …

HARDSEA: Hybrid analog-ReRAM clustering and digital-SRAM in-memory computing accelerator for dynamic sparse self-attention in transformer

S Liu, C Mu, H Jiang, Y Wang, J Zhang… - … Transactions on Very …, 2023 - ieeexplore.ieee.org
Self-attention-based transformers have outperformed recurrent and convolutional neural
networks (RNN/CNNs) in many applications. Despite the effectiveness, calculating self …