A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Gptq: Accurate post-training quantization for generative pre-trained transformers

E Frantar, S Ashkboos, T Hoefler, D Alistarh - arxiv preprint arxiv …, 2022 - arxiv.org
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart
through breakthrough performance across complex language modelling tasks, but also by …

Efficient large language models: A survey

Z Wan, X Wang, C Liu, S Alam, Y Zheng, J Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …

OPTQ: Accurate quantization for generative pre-trained transformers

E Frantar, S Ashkboos, T Hoefler… - … Conference on Learning …, 2022 - openreview.net
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart
through breakthrough performance across complex language modelling tasks, but also by …

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

Speculative decoding with big little decoder

S Kim, K Mangalam, S Moon, J Malik… - Advances in …, 2024 - proceedings.neurips.cc
The recent emergence of Large Language Models based on the Transformer architecture
has enabled dramatic advancements in the field of Natural Language Processing. However …

Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation

Z Yao, X Wu, C Li, S Youn, Y He - arxiv preprint arxiv:2303.08302, 2023 - arxiv.org
Post-training quantization (PTQ) has emerged as a promising technique for mitigating
memory consumption and computational costs in large language models (LLMs). However …

Understanding int4 quantization for language models: latency speedup, composability, and failure cases

X Wu, C Li, RY Aminabadi, Z Yao… - … Conference on Machine …, 2023 - proceedings.mlr.press
Improving the deployment efficiency of transformer-based language models has been
challenging given their high computation and memory cost. While INT8 quantization has …

Exploring post-training quantization in llms from comprehensive study to low rank compensation

Z Yao, X Wu, C Li, S Youn, Y He - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Post-training quantization (PTQ) has emerged as a promising technique for mitigating
memory consumption and computational costs in large language models (LLMs). However …

Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats

X Wu, Z Yao, Y He - arxiv preprint arxiv:2307.09782, 2023 - arxiv.org
In the complex domain of large language models (LLMs), striking a balance between
computational efficiency and maintaining model quality is a formidable challenge …