High-dimensional limit theorems for sgd: Effective dynamics and critical scaling

G Ben Arous, R Gheissari… - Advances in Neural …, 2022 - proceedings.neurips.cc
We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in
the high-dimensional regime. We prove limit theorems for the trajectories of summary …

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

M Even, S Pesme, S Gunasekar… - Advances in Neural …, 2023 - proceedings.neurips.cc
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …

A modern look at the relationship between sharpness and generalization

M Andriushchenko, F Croce, M Müller, M Hein… - arxiv preprint arxiv …, 2023 - arxiv.org
Sharpness of minima is a promising quantity that can correlate with generalization in deep
networks and, when optimized during training, can improve generalization. However …

Learning threshold neurons via edge of stability

K Ahn, S Bubeck, S Chewi, YT Lee… - Advances in Neural …, 2023 - proceedings.neurips.cc
Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …

Dynamics of finite width kernel and prediction fluctuations in mean field neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …

How Sharpness-Aware Minimization Minimizes Sharpness?

K Wen, T Ma, Z Li - The Eleventh International Conference on …, 2023 - openreview.net
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for
improving the generalization of deep neural networks for various settings. However, the …

The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima

PL Bartlett, PM Long, O Bousquet - Journal of Machine Learning Research, 2023 - jmlr.org
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method
for deep networks that has exhibited performance improvements on image and language …

Understanding multi-phase optimization dynamics and rich nonlinear behaviors of relu networks

M Wang, C Ma - Advances in Neural Information Processing …, 2024 - proceedings.neurips.cc
The training process of ReLU neural networks often exhibits complicated nonlinear
phenomena. The nonlinearity of models and non-convexity of loss pose significant …

Implicit Bias of AdamW: -Norm Constrained Optimization

S **e, Z Li - International Conference on Machine Learning, 2024 - proceedings.mlr.press
Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its
superior performance in language modeling tasks, surpassing Adam with $\ell_2 …

SAM operates far from home: eigenvalue regularization as a dynamical phenomenon

A Agarwala, Y Dauphin - International Conference on …, 2023 - proceedings.mlr.press
Abstract The Sharpness Aware Minimization (SAM) optimization algorithm has been shown
to control large eigenvalues of the loss Hessian and provide generalization benefits in a …