A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay

LN Smith - ar** and diagnosing datasets with training dynamics
S Swayamdipta, R Schwartz, N Lourie, Y Wang… - arxiv preprint arxiv …, 2020 - arxiv.org
Large datasets have become commonplace in NLP research. However, the increased
emphasis on data quantity has made it challenging to assess the quality of data. We …

Don't use large mini-batches, use local sgd

T Lin, SU Stich, KK Patel, M Jaggi - arxiv preprint arxiv:1808.07217, 2018 - arxiv.org
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …

Towards explaining the regularization effect of initial large learning rate in training neural networks

Y Li, C Wei, T Ma - Advances in neural information …, 2019 - proceedings.neurips.cc
Stochastic gradient descent with a large initial learning rate is widely used for training
modern neural net architectures. Although a small initial learning rate allows for faster …

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

No train no gain: Revisiting efficient training algorithms for transformer-based language models

J Kaddour, O Key, P Nawrot… - Advances in Neural …, 2024 - proceedings.neurips.cc
The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …

Understanding the unstable convergence of gradient descent

K Ahn, J Zhang, S Sra - International Conference on …, 2022 - proceedings.mlr.press
Most existing analyses of (stochastic) gradient descent rely on the condition that for $ L $-
smooth costs, the step size is less than $2/L $. However, many works have observed that in …

Self-stabilization: The implicit bias of gradient descent at the edge of stability

A Damian, E Nichani, JD Lee - arxiv preprint arxiv:2209.15594, 2022 - arxiv.org
Traditional analyses of gradient descent show that when the largest eigenvalue of the
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …

A modern look at the relationship between sharpness and generalization

M Andriushchenko, F Croce, M Müller, M Hein… - arxiv preprint arxiv …, 2023 - arxiv.org
Sharpness of minima is a promising quantity that can correlate with generalization in deep
networks and, when optimized during training, can improve generalization. However …

Learning threshold neurons via edge of stability

K Ahn, S Bubeck, S Chewi, YT Lee… - Advances in Neural …, 2023 - proceedings.neurips.cc
Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …