A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Weight-sharing neural architecture search: A battle to shrink the optimization gap

L **e, X Chen, K Bi, L Wei, Y Xu, L Wang… - ACM Computing …, 2021 - dl.acm.org
Neural architecture search (NAS) has attracted increasing attention. In recent years,
individual search methods have been replaced by weight-sharing search methods for higher …

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

A fast post-training pruning framework for transformers

W Kwon, S Kim, MW Mahoney… - Advances in …, 2022 - proceedings.neurips.cc
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …

Speculative decoding with big little decoder

S Kim, K Mangalam, S Moon, J Malik… - Advances in …, 2024 - proceedings.neurips.cc
The recent emergence of Large Language Models based on the Transformer architecture
has enabled dramatic advancements in the field of Natural Language Processing. However …

Enable deep learning on mobile devices: Methods, systems, and applications

H Cai, J Lin, Y Lin, Z Liu, H Tang, H Wang… - ACM Transactions on …, 2022 - dl.acm.org
Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial
intelligence (AI), including computer vision, natural language processing, and speech …

Funnel-transformer: Filtering out sequential redundancy for efficient language processing

Z Dai, G Lai, Y Yang, Q Le - Advances in neural information …, 2020 - proceedings.neurips.cc
With the success of language pretraining, it is highly desirable to develop more efficient
architectures of good scalability that can exploit the abundant unlabeled data at a lower cost …

Compressing large-scale transformer-based models: A case study on bert

P Ganesh, Y Chen, X Lou, MA Khan, Y Yang… - Transactions of the …, 2021 - direct.mit.edu
Pre-trained Transformer-based models have achieved state-of-the-art performance for
various Natural Language Processing (NLP) tasks. However, these models often have …

Neural architecture search for transformers: A survey

KT Chitty-Venkata, M Emani, V Vishwanath… - IEEE …, 2022 - ieeexplore.ieee.org
Transformer-based Deep Neural Network architectures have gained tremendous interest
due to their effectiveness in various applications across Natural Language Processing (NLP) …

Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval

D Gao, L **, B Chen, M Qiu, P Li, Y Wei, Y Hu… - Proceedings of the 43rd …, 2020 - dl.acm.org
In this paper, we address the text and image matching in cross-modal retrieval of the fashion
industry. Different from the matching in the general domain, the fashion matching is required …