On the implicit bias in deep-learning algorithms

G Vardi - Communications of the ACM, 2023 - dl.acm.org
On the Implicit Bias in Deep-Learning Algorithms Page 1 DEEP LEARNING HAS been highly
successful in recent years and has led to dramatic improvements in multiple domains …

Understanding gradient descent on the edge of stability in deep learning

S Arora, Z Li, A Panigrahi - International Conference on …, 2022 - proceedings.mlr.press
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …

Understanding the generalization benefit of normalization layers: Sharpness reduction

K Lyu, Z Li, S Arora - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Abstract Normalization layers (eg, Batch Normalization, Layer Normalization) were
introduced to help with optimization difficulties in very deep nets, but they clearly also help …

Self-stabilization: The implicit bias of gradient descent at the edge of stability

A Damian, E Nichani, JD Lee - arxiv preprint arxiv:2209.15594, 2022 - arxiv.org
Traditional analyses of gradient descent show that when the largest eigenvalue of the
Hessian, also known as the sharpness $ S (\theta) $, is bounded by $2/\eta $, training is" …

Learning threshold neurons via edge of stability

K Ahn, S Bubeck, S Chewi, YT Lee… - Advances in Neural …, 2023 - proceedings.neurips.cc
Existing analyses of neural network training often operate under the unrealistic assumption
of an extremely small learning rate. This lies in stark contrast to practical wisdom and …

Implicit bias of the step size in linear diagonal neural networks

MS Nacson, K Ravichandran… - International …, 2022 - proceedings.mlr.press
Focusing on diagonal linear networks as a model for understanding the implicit bias in
underdetermined models, we show how the gradient descent step size can have a large …

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

J Wu, PL Bartlett, M Telgarsky… - The Thirty Seventh …, 2024 - proceedings.mlr.press
We consider\emph {gradient descent}(GD) with a constant stepsize applied to logistic
regression with linearly separable data, where the constant stepsize $\eta $ is so large that …

Two sides of one coin: the limits of untuned sgd and the power of adaptive methods

J Yang, X Li, I Fatkhullin, N He - Advances in Neural …, 2023 - proceedings.neurips.cc
The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying
stepsize $\eta_t=\eta/\sqrt {t} $ relies on well-tuned $\eta $ depending on problem …

Adaptive gradient methods at the edge of stability

JM Cohen, B Ghorbani, S Krishnan, N Agarwal… - arxiv preprint arxiv …, 2022 - arxiv.org
Very little is known about the training dynamics of adaptive gradient methods like Adam in
deep learning. In this paper, we shed light on the behavior of these algorithms in the full …

Implicit bias of gradient descent for logistic regression at the edge of stability

J Wu, V Braverman, JD Lee - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent research has observed that in machine learning optimization, gradient descent (GD)
often operates at the edge of stability (EoS)[Cohen et al., 2021], where the stepsizes are set …