Lightweight deep learning for resource-constrained environments: A survey

HI Liu, M Galindo, H **e, LK Wong, HH Shuai… - ACM Computing …, 2024 - dl.acm.org
Over the past decade, the dominance of deep learning has prevailed across various
domains of artificial intelligence, including natural language processing, computer vision …

Pre-trained models for natural language processing: A survey

X Qiu, T Sun, Y Xu, Y Shao, N Dai, X Huang - Science China …, 2020 - Springer
Recently, the emergence of pre-trained models (PTMs) has brought natural language
processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs …

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

T Dettmers, M Lewis, Y Belkada… - Advances in neural …, 2022 - proceedings.neurips.cc
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …

Quip: 2-bit quantization of large language models with guarantees

J Chee, Y Cai, V Kuleshov… - Advances in Neural …, 2023 - proceedings.neurips.cc
This work studies post-training parameter quantization in large language models (LLMs).
We introduce quantization with incoherence processing (QuIP), a new method based on the …

Q-diffusion: Quantizing diffusion models

X Li, Y Liu, L Lian, H Yang, Z Dong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Diffusion models have achieved great success in image synthesis through iterative noise
estimation using deep neural networks. However, the slow inference, high memory …

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Z Yao, R Yazdani Aminabadi… - Advances in …, 2022 - proceedings.neurips.cc
How to efficiently serve ever-larger trained natural language models in practice has become
exceptionally challenging even for powerful cloud servers due to their prohibitive …

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

A white paper on neural network quantization

M Nagel, M Fournarakis, RA Amjad… - arxiv preprint arxiv …, 2021 - arxiv.org
While neural networks have advanced the frontiers in many applications, they often come at
a high computational cost. Reducing the power and latency of neural network inference is …

A survey of quantization methods for efficient neural network inference

A Gholami, S Kim, Z Dong, Z Yao… - Low-power computer …, 2022 - taylorfrancis.com
This chapter provides approaches to the problem of quantizing the numerical values in deep
Neural Network computations, covering the advantages/disadvantages of current methods …

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

S Rajbhandari, C Li, Z Yao, M Zhang… - International …, 2022 - proceedings.mlr.press
As the training of giant dense models hits the boundary on the availability and capability of
the hardware resources today, Mixture-of-Experts (MoE) models have become one of the …