- Academic Search

A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay

LN Smith - ar** and diagnosing datasets with training dynamics

S Swayamdipta, R Schwartz, N Lourie, Y Wang… - arxiv preprint arxiv …, 2020 - arxiv.org

Large datasets have become commonplace in NLP research. However, the increased
emphasis on data quantity has made it challenging to assess the quality of data. We …

Save Cite Cited by 411 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Don't use large mini-batches, use local sgd

T Lin, SU Stich, KK Patel, M Jaggi - arxiv preprint arxiv:1808.07217, 2018 - arxiv.org

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …

Save Cite Cited by 506 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Towards explaining the regularization effect of initial large learning rate in training neural networks

Y Li, C Wei, T Ma - Advances in neural information …, 2019 - proceedings.neurips.cc

Stochastic gradient descent with a large initial learning rate is widely used for training
modern neural net architectures. Although a small initial learning rate allows for faster …

Save Cite Cited by 369 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] mlr.press

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

Save Cite Cited by 61 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

No train no gain: Revisiting efficient training algorithms for transformer-based language models

J Kaddour, O Key, P Nawrot… - Advances in Neural …, 2024 - proceedings.neurips.cc

The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …

Save Cite Cited by 30 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] mlr.press

Understanding the unstable convergence of gradient descent

K Ahn, J Zhang, S Sra - International Conference on …, 2022 - proceedings.mlr.press

Most existing analyses of (stochastic) gradient descent rely on the condition that for $ L $-
smooth costs, the step size is less than $2/L $. However, many works have observed that in …

Save Cite Cited by 77 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Self-stabilization: The implicit bias of gradient descent at the edge of stability

A Damian, E Nichani, JD Lee - arxiv preprint arxiv:2209.15594, 2022 - arxiv.org

Traditional analyses of gradient descent show that when the largest eigenvalue of the
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …

Save Cite Cited by 85 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A modern look at the relationship between sharpness and generalization

M Andriushchenko, F Croce, M Müller, M Hein… - arxiv preprint arxiv …, 2023 - arxiv.org

Sharpness of minima is a promising quantity that can correlate with generalization in deep
networks and, when optimized during training, can improve generalization. However …

Save Cite Cited by 55 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Learning threshold neurons via edge of stability

K Ahn, S Bubeck, S Chewi, YT Lee… - Advances in Neural …, 2023 - proceedings.neurips.cc

Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …

Save Cite Cited by 42 Related articles All 5 versions Free GPT-4 View as HTML

Cite

Advanced search

Saved to My library

A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay

Don't use large mini-batches, use local sgd

Towards explaining the regularization effect of initial large learning rate in training neural networks

Sgd with large step sizes learns sparse features

No train no gain: Revisiting efficient training algorithms for transformer-based language models

Understanding the unstable convergence of gradient descent

Self-stabilization: The implicit bias of gradient descent at the edge of stability

A modern look at the relationship between sharpness and generalization

Learning threshold neurons via edge of stability