Understanding gradient descent on the edge of stability in deep learning
Deep learning experiments by\citet {cohen2021gradient} using deterministic Gradient
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …
Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and …
Towards theoretically understanding why sgd generalizes better than adam in deep learning
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse
generalization performance than SGD despite their faster training speed. This work aims to …
generalization performance than SGD despite their faster training speed. This work aims to …
Don't use large mini-batches, use local sgd
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …
deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency …
On the origin of implicit regularization in stochastic gradient descent
For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of
gradient flow on the full batch loss function. However moderately large learning rates can …
gradient flow on the full batch loss function. However moderately large learning rates can …
Stochastic gradient descent as approximate bayesian inference
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a
Markov chain with a stationary distribution. With this perspective, we derive several new …
Markov chain with a stationary distribution. With this perspective, we derive several new …
Three factors influencing minima in sgd
We investigate the dynamical and convergent properties of stochastic gradient descent
(SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between …
(SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between …
Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks
Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when
used to train deep neural networks, but the precise manner in which this occurs has thus far …
used to train deep neural networks, but the precise manner in which this occurs has thus far …
Implicit gradient regularization
Gradient descent can be surprisingly good at optimizing deep neural networks without
overfitting and without explicit regularization. We find that the discrete steps of gradient …
overfitting and without explicit regularization. We find that the discrete steps of gradient …
Understanding the acceleration phenomenon via high-resolution differential equations
Gradient-based optimization algorithms can be studied from the perspective of limiting
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …
What Happens after SGD Reaches Zero Loss?--A Mathematical Framework
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key
challenges in deep learning, especially for overparametrized models, where the local …
challenges in deep learning, especially for overparametrized models, where the local …