Model quantization and hardware acceleration for vision transformers: A comprehensive survey

D Du, G Gong, X Chu - arxiv preprint arxiv:2405.00314, 2024 - arxiv.org
Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a
promising alternative to convolutional neural networks (CNNs) in several vision-related …

A survey of FPGA and ASIC designs for transformer inference acceleration and optimization

BJ Kang, HI Lee, SK Yoon, YC Kim, SB Jeong… - Journal of Systems …, 2024 - Elsevier
Recently, transformer-based models have achieved remarkable success in various fields,
such as computer vision, speech recognition, and natural language processing. However …

Kangaroo: Lossless self-speculative decoding via double early exiting

F Liu, Y Tang, Z Liu, Y Ni, K Han, Y Wang - arxiv preprint arxiv …, 2024 - arxiv.org
Speculative decoding has demonstrated its effectiveness in accelerating the inference of
large language models while maintaining a consistent sampling distribution. However, the …

On energy complexity of fully-connected layers

J Šíma, J Cabessa, P Vidnerová - Neural Networks, 2024 - Elsevier
The massive increase in the size of deep neural networks (DNNs) is accompanied by a
significant increase in energy consumption of their hardware implementations which is …

A Survey on Large Language Model Acceleration based on KV Cache Management

H Li, Y Li, A Tian, T Tang, Z Xu, X Chen, N Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have revolutionized a wide range of domains such as
natural language processing, computer vision, and multi-modal tasks due to their ability to …

Analysis and Behavioral Modeling Using Augmented Transformer for Satellite Communication Power Amplifiers

G Zhao, K Ying, Q Wen, L Zhao, J Pang… - IEEE Internet of …, 2024 - ieeexplore.ieee.org
To meet the demand for high-speed and high-quality communication in next 6G satellite
communication, it is very necessary and urgent to study the behavioral modeling of 6G …

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

J Guo, X Chen, Y Tang, Y Wang - arxiv preprint arxiv:2405.11582, 2024 - arxiv.org
Transformers have become foundational architectures for both natural language and
computer vision tasks. However, the high computational cost makes it quite challenging to …

Towards Effective Data-Free Knowledge Distillation via Diverse Diffusion Augmentation

M Li, D Zhang, T He, X **e, YF Li, K Qin - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Data-free knowledge distillation (DFKD) has emerged as a pivotal technique in the domain
of model compression, substantially reducing the dependency on the original training data …

[PDF][PDF] Efficient model compression and knowledge distillation on llama 2: Achieving high performance with reduced computational cost

Q Huangpu, H Gao - 2024 - files.osf.io
This study investigates the application of model compression and knowledge distillation
techniques to enhance the computational efficiency of LLama 2, a Large Language Model …

Hotfixing Large Language Models for Code

Z Yang, D Lo - arxiv preprint arxiv:2408.05727, 2024 - arxiv.org
Large Language Models for Code (LLM4Code) have become an integral part of developers'
workflows, assisting with tasks such as code completion and generation. However, these …