[HTML][HTML] Pre-trained language models and their applications

H Wang, J Li, H Wu, E Hovy, Y Sun - Engineering, 2023 - Elsevier
Pre-trained language models have achieved striking success in natural language
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …

A review of sparse expert models in deep learning

W Fedus, J Dean, B Zoph - arxiv preprint arxiv:2209.01667, 2022 - arxiv.org
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in
deep learning. This class of architecture encompasses Mixture-of-Experts, Switch …

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

T Dettmers, M Lewis, Y Belkada… - Advances in Neural …, 2022 - proceedings.neurips.cc
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …

Palm: Scaling language modeling with pathways

A Chowdhery, S Narang, J Devlin, M Bosma… - Journal of Machine …, 2023 - jmlr.org
Large language models have been shown to achieve remarkable performance across a
variety of natural language tasks using few-shot learning, which drastically reduces the …

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arxiv preprint arxiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X **… - … USENIX Symposium on …, 2023 - usenix.org
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

Content-aware local gan for photo-realistic super-resolution

JK Park, S Son, KM Lee - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recently, GAN has successfully contributed to making single-image super-resolution (SISR)
methods produce more realistic images. However, natural images have complex distribution …

Efficient large-scale language model training on gpu clusters using megatron-lm

D Narayanan, M Shoeybi, J Casper… - Proceedings of the …, 2021 - dl.acm.org
Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

W Fedus, B Zoph, N Shazeer - Journal of Machine Learning Research, 2022 - jmlr.org
In deep learning, models typically reuse the same parameters for all inputs. Mixture of
Experts (MoE) models defy this and instead select different parameters for each incoming …