High-dimensional limit theorems for sgd: Effective dynamics and critical scaling
We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in
the high-dimensional regime. We prove limit theorems for the trajectories of summary …
the high-dimensional regime. We prove limit theorems for the trajectories of summary …
(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …
A modern look at the relationship between sharpness and generalization
Sharpness of minima is a promising quantity that can correlate with generalization in deep
networks and, when optimized during training, can improve generalization. However …
networks and, when optimized during training, can improve generalization. However …
Learning threshold neurons via edge of stability
Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …
Dynamics of finite width kernel and prediction fluctuations in mean field neural networks
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …
networks. Starting from a dynamical mean field theory description of infinite width deep …
How Sharpness-Aware Minimization Minimizes Sharpness?
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for
improving the generalization of deep neural networks for various settings. However, the …
improving the generalization of deep neural networks for various settings. However, the …
The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method
for deep networks that has exhibited performance improvements on image and language …
for deep networks that has exhibited performance improvements on image and language …
Understanding multi-phase optimization dynamics and rich nonlinear behaviors of relu networks
The training process of ReLU neural networks often exhibits complicated nonlinear
phenomena. The nonlinearity of models and non-convexity of loss pose significant …
phenomena. The nonlinearity of models and non-convexity of loss pose significant …
Implicit Bias of AdamW: -Norm Constrained Optimization
S **e, Z Li - International Conference on Machine Learning, 2024 - proceedings.mlr.press
Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its
superior performance in language modeling tasks, surpassing Adam with $\ell_2 …
superior performance in language modeling tasks, surpassing Adam with $\ell_2 …
SAM operates far from home: eigenvalue regularization as a dynamical phenomenon
Abstract The Sharpness Aware Minimization (SAM) optimization algorithm has been shown
to control large eigenvalues of the loss Hessian and provide generalization benefits in a …
to control large eigenvalues of the loss Hessian and provide generalization benefits in a …