No train no gain: Revisiting efficient training algorithms for transformer-based language models

J Kaddour, O Key, P Nawrot… - Advances in Neural …, 2024 - proceedings.neurips.cc
The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …

Large-scale differentially private BERT

R Anil, B Ghazi, V Gupta, R Kumar… - arxiv preprint arxiv …, 2021 - arxiv.org
In this work, we study the large-scale pretraining of BERT-Large with differentially private
SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch …

Does knowledge distillation really work?

S Stanton, P Izmailov, P Kirichenko… - Advances in …, 2021 - proceedings.neurips.cc
Abstract Knowledge distillation is a popular technique for training a small student network to
emulate a larger teacher model, such as an ensemble of networks. We show that while …

Sharpness-aware minimization improves language model generalization

D Bahri, H Mobahi, Y Tay - arxiv preprint arxiv:2110.08529, 2021 - arxiv.org
The allure of superhuman-level capabilities has led to considerable interest in language
models like GPT-3 and T5, wherein the research has, by and large, revolved around new …