Understanding gradient descent on the edge of stability in deep learning

S Arora, Z Li, A Panigrahi - International Conference on …, 2022 - proceedings.mlr.press
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …

Towards theoretically understanding why sgd generalizes better than adam in deep learning

P Zhou, J Feng, C Ma, C **ong… - Advances in Neural …, 2020 - proceedings.neurips.cc
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse
generalization performance than SGD despite their faster training speed. This work aims to …

Don't use large mini-batches, use local sgd

T Lin, SU Stich, KK Patel, M Jaggi - arxiv preprint arxiv:1808.07217, 2018 - arxiv.org
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …

On the origin of implicit regularization in stochastic gradient descent

SL Smith, B Dherin, DGT Barrett, S De - arxiv preprint arxiv:2101.12176, 2021 - arxiv.org
For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of
gradient flow on the full batch loss function. However moderately large learning rates can …

Stochastic gradient descent as approximate bayesian inference

M Stephan, MD Hoffman, DM Blei - Journal of Machine Learning …, 2017 - jmlr.org
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a
Markov chain with a stationary distribution. With this perspective, we derive several new …

Three factors influencing minima in sgd

S Jastrzębski, Z Kenton, D Arpit, N Ballas… - arxiv preprint arxiv …, 2017 - arxiv.org
We investigate the dynamical and convergent properties of stochastic gradient descent
(SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between …

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

P Chaudhari, S Soatto - 2018 Information Theory and …, 2018 - ieeexplore.ieee.org
Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when
used to train deep neural networks, but the precise manner in which this occurs has thus far …

Implicit gradient regularization

DGT Barrett, B Dherin - arxiv preprint arxiv:2009.11162, 2020 - arxiv.org
Gradient descent can be surprisingly good at optimizing deep neural networks without
overfitting and without explicit regularization. We find that the discrete steps of gradient …

Understanding the acceleration phenomenon via high-resolution differential equations

B Shi, SS Du, MI Jordan, WJ Su - Mathematical Programming, 2022 - Springer
Gradient-based optimization algorithms can be studied from the perspective of limiting
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework

Z Li, T Wang, S Arora - arxiv preprint arxiv:2110.06914, 2021 - arxiv.org
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key
challenges in deep learning, especially for overparametrized models, where the local …