- Academic Search

Q Tao, L Li, X Huang, X **23a/gei**23a.pdf" data-clk="hl=pl&sa=T&oi=gga&ct=gga&cd=5&d=11540674001513001827&ei=3eSnZ92gB9mlieoPz5PtyA0" data-clk-atid="Y7PE-Nu1KKAJ" target="_blank">[PDF] mlr.press

Cramming: Training a Language Model on a single GPU in one day.

J Gei**, T Goldstein - International Conference on …, 2023 - proceedings.mlr.press

Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …

Zapisz Cytuj Cytowane przez 77 Powiązane artykuły Wszystkie wersje 7 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

No train no gain: Revisiting efficient training algorithms for transformer-based language models

J Kaddour, O Key, P Nawrot… - Advances in Neural …, 2024 - proceedings.neurips.cc

The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …

Zapisz Cytuj Cytowane przez 30 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Large-scale differentially private BERT

R Anil, B Ghazi, V Gupta, R Kumar… - arxiv preprint arxiv …, 2021 - arxiv.org

In this work, we study the large-scale pretraining of BERT-Large with differentially private
SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch …

Zapisz Cytuj Cytowane przez 145 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]

[PDF] neurips.cc

Does knowledge distillation really work?

S Stanton, P Izmailov, P Kirichenko… - Advances in …, 2021 - proceedings.neurips.cc

Abstract Knowledge distillation is a popular technique for training a small student network to
emulate a larger teacher model, such as an ensemble of networks. We show that while …

Zapisz Cytuj Cytowane przez 251 Powiązane artykuły Wszystkie wersje 12 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Sharpness-aware minimization improves language model generalization

D Bahri, H Mobahi, Y Tay - arxiv preprint arxiv:2110.08529, 2021 - arxiv.org

The allure of superhuman-level capabilities has led to considerable interest in language
models like GPT-3 and T5, wherein the research has, by and large, revolved around new …

Zapisz Cytuj Cytowane przez 94 Powiązane artykuły Wszystkie wersje 7 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Scalable second order optimization for deep learning

Piecewise linear neural networks and deep learning

Cramming: Training a Language Model on a single GPU in one day.

No train no gain: Revisiting efficient training algorithms for transformer-based language models

Large-scale differentially private BERT

Does knowledge distillation really work?

Sharpness-aware minimization improves language model generalization