Machine learning at the network edge: A survey

MGS Murshed, C Murphy, D Hou, N Khan… - ACM Computing …, 2021 - dl.acm.org
Resource-constrained IoT devices, such as sensors and actuators, have become ubiquitous
in recent years. This has led to the generation of large quantities of data in real-time, which …

A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Spatten: Efficient sparse attention architecture with cascade token and head pruning

H Wang, Z Zhang, S Han - 2021 IEEE International Symposium …, 2021 - ieeexplore.ieee.org
The attention mechanism is becoming increasingly popular in Natural Language Processing
(NLP) applications, showing superior performance than convolutional and recurrent …

Hawq-v3: Dyadic neural network quantization

Z Yao, Z Dong, Z Zheng, A Gholami… - International …, 2021 - proceedings.mlr.press
Current low-precision quantization algorithms often have the hidden cost of conversion back
and forth from floating point to quantized integer values. This hidden cost limits the latency …

Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization

C Guo, J Tang, W Hu, J Leng, C Zhang… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer-based large language models (LLMs) have achieved great success with the
growing model size. LLMs' size grows by 240× every two years, which outpaces the …

Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction

Y Qin, Y Wang, D Deng, Z Zhao, X Yang, L Liu… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer model is becoming prevalent in various AI applications with its outstanding
performance. However, the high cost of computation and memory footprint make its …

Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization

C Guo, C Zhang, J Leng, Z Liu, F Yang… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Quantization is a technique to reduce the computation and memory cost of DNN models,
which are getting increasingly large. Existing quantization solutions use fixed-point integer …

[PDF][PDF] Optimizing Selective Protection for CNN Resilience.

A Mahmoud, SKS Hari, CW Fletcher, SV Adve, C Sakr… - ISSRE, 2021 - ma3mool.github.io
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …

Unleashing the Potential of Spiking Neural Networks with Dynamic Confidence

C Li, EG Jones, S Furber - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
This paper presents a new methodology to alleviate the fundamental trade-off between
accuracy and latency in spiking neural networks (SNNs). The approach involves decoding …

Energon: Toward efficient acceleration of transformers using dynamic sparse attention

Z Zhou, J Liu, Z Gu, G Sun - IEEE Transactions on Computer …, 2022 - ieeexplore.ieee.org
In recent years, transformer models have revolutionized natural language processing (NLP)
and shown promising performance on computer vision (CV) tasks. Despite their …