Cramming: Training a Language Model on a single GPU in one day.
Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …
scaling, and have resulted in an environment where training language models is out of …
No train no gain: Revisiting efficient training algorithms for transformer-based language models
The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …
skyrocketed in recent years. This trend has motivated research on efficient training …
Large-scale differentially private BERT
In this work, we study the large-scale pretraining of BERT-Large with differentially private
SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch …
SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch …
Does knowledge distillation really work?
Abstract Knowledge distillation is a popular technique for training a small student network to
emulate a larger teacher model, such as an ensemble of networks. We show that while …
emulate a larger teacher model, such as an ensemble of networks. We show that while …
Sharpness-aware minimization improves language model generalization
The allure of superhuman-level capabilities has led to considerable interest in language
models like GPT-3 and T5, wherein the research has, by and large, revolved around new …
models like GPT-3 and T5, wherein the research has, by and large, revolved around new …