Machine learning at the network edge: A survey
Resource-constrained IoT devices, such as sensors and actuators, have become ubiquitous
in recent years. This has led to the generation of large quantities of data in real-time, which …
in recent years. This has led to the generation of large quantities of data in real-time, which …
A survey of techniques for optimizing transformer inference
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …
transformer neural networks. The family of transformer networks, including Bidirectional …
Spatten: Efficient sparse attention architecture with cascade token and head pruning
The attention mechanism is becoming increasingly popular in Natural Language Processing
(NLP) applications, showing superior performance than convolutional and recurrent …
(NLP) applications, showing superior performance than convolutional and recurrent …
Hawq-v3: Dyadic neural network quantization
Current low-precision quantization algorithms often have the hidden cost of conversion back
and forth from floating point to quantized integer values. This hidden cost limits the latency …
and forth from floating point to quantized integer values. This hidden cost limits the latency …
Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization
Transformer-based large language models (LLMs) have achieved great success with the
growing model size. LLMs' size grows by 240× every two years, which outpaces the …
growing model size. LLMs' size grows by 240× every two years, which outpaces the …
Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction
Transformer model is becoming prevalent in various AI applications with its outstanding
performance. However, the high cost of computation and memory footprint make its …
performance. However, the high cost of computation and memory footprint make its …
Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization
Quantization is a technique to reduce the computation and memory cost of DNN models,
which are getting increasingly large. Existing quantization solutions use fixed-point integer …
which are getting increasingly large. Existing quantization solutions use fixed-point integer …
[PDF][PDF] Optimizing Selective Protection for CNN Resilience.
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …
applications that demand high reliability, it is important to ensure that they are resilient to …
Unleashing the Potential of Spiking Neural Networks with Dynamic Confidence
This paper presents a new methodology to alleviate the fundamental trade-off between
accuracy and latency in spiking neural networks (SNNs). The approach involves decoding …
accuracy and latency in spiking neural networks (SNNs). The approach involves decoding …
Energon: Toward efficient acceleration of transformers using dynamic sparse attention
In recent years, transformer models have revolutionized natural language processing (NLP)
and shown promising performance on computer vision (CV) tasks. Despite their …
and shown promising performance on computer vision (CV) tasks. Despite their …