A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay
LN Smith - ar** and diagnosing datasets with training dynamics
Large datasets have become commonplace in NLP research. However, the increased
emphasis on data quantity has made it challenging to assess the quality of data. We …
emphasis on data quantity has made it challenging to assess the quality of data. We …
Don't use large mini-batches, use local sgd
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …
Towards explaining the regularization effect of initial large learning rate in training neural networks
Stochastic gradient descent with a large initial learning rate is widely used for training
modern neural net architectures. Although a small initial learning rate allows for faster …
modern neural net architectures. Although a small initial learning rate allows for faster …
Sgd with large step sizes learns sparse features
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …
in the training of neural networks. We present empirical observations that commonly used …
No train no gain: Revisiting efficient training algorithms for transformer-based language models
The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …
skyrocketed in recent years. This trend has motivated research on efficient training …
Understanding the unstable convergence of gradient descent
Most existing analyses of (stochastic) gradient descent rely on the condition that for $ L $-
smooth costs, the step size is less than $2/L $. However, many works have observed that in …
smooth costs, the step size is less than $2/L $. However, many works have observed that in …
Self-stabilization: The implicit bias of gradient descent at the edge of stability
Traditional analyses of gradient descent show that when the largest eigenvalue of the
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …
A modern look at the relationship between sharpness and generalization
Sharpness of minima is a promising quantity that can correlate with generalization in deep
networks and, when optimized during training, can improve generalization. However …
networks and, when optimized during training, can improve generalization. However …
Learning threshold neurons via edge of stability
Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …