[HTML][HTML] Pre-trained language models and their applications
Pre-trained language models have achieved striking success in natural language
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …
A review of sparse expert models in deep learning
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in
deep learning. This class of architecture encompasses Mixture-of-Experts, Switch …
deep learning. This class of architecture encompasses Mixture-of-Experts, Switch …
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
Palm: Scaling language modeling with pathways
Large language models have been shown to achieve remarkable performance across a
variety of natural language tasks using few-shot learning, which drastically reduces the …
variety of natural language tasks using few-shot learning, which drastically reduces the …
Orca: A distributed serving system for {Transformer-Based} generative models
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …
recently attracted huge interest, emphasizing the need for system support for serving models …
On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …
learning model beyond the memory limits of a single device. In this paper, we demonstrate …
Content-aware local gan for photo-realistic super-resolution
Recently, GAN has successfully contributed to making single-image super-resolution (SISR)
methods produce more realistic images. However, natural images have complex distribution …
methods produce more realistic images. However, natural images have complex distribution …
Efficient large-scale language model training on gpu clusters using megatron-lm
Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …
However, training these models efficiently is challenging because: a) GPU memory capacity …
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
In deep learning, models typically reuse the same parameters for all inputs. Mixture of
Experts (MoE) models defy this and instead select different parameters for each incoming …
Experts (MoE) models defy this and instead select different parameters for each incoming …