Dive into deep learning
This open-source book represents our attempt to make deep learning approachable,
teaching readers the concepts, the context, and the code. The entire book is drafted in …
teaching readers the concepts, the context, and the code. The entire book is drafted in …
Stochastic gradient descent as approximate bayesian inference
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a
Markov chain with a stationary distribution. With this perspective, we derive several new …
Markov chain with a stationary distribution. With this perspective, we derive several new …
A variational perspective on accelerated methods in optimization
Accelerated gradient methods play a central role in optimization, achieving optimal rates in
many settings. Although many generalizations and extensions of Nesterov's original …
many settings. Although many generalizations and extensions of Nesterov's original …
Understanding the acceleration phenomenon via high-resolution differential equations
Gradient-based optimization algorithms can be studied from the perspective of limiting
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …
[HTML][HTML] Why momentum really works
G Goh - Distill, 2017 - distill.pub
Why Momentum Really Works Distill About Prize Submit Why Momentum Really Works Step-size
α = 0.02 Momentum β = 0.99 We often think of Momentum as a means of dampening …
α = 0.02 Momentum β = 0.99 We often think of Momentum as a means of dampening …
Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be
The success of the Adam optimizer on a wide array of architectures has made it the default
in settings where stochastic gradient descent (SGD) performs poorly. However, our …
in settings where stochastic gradient descent (SGD) performs poorly. However, our …
Nonparametric stochastic approximation with large step-sizes
A Dieuleveut, F Bach - 2016 - projecteuclid.org
We consider the random-design least-squares regression problem within the reproducing
kernel Hilbert space (RKHS) framework. Given a stream of independent and identically …
kernel Hilbert space (RKHS) framework. Given a stream of independent and identically …
Harder, better, faster, stronger convergence rates for least-squares regression
We consider the optimization of a quadratic objective function whose gradients are only
accessible through a stochastic oracle that returns the gradient at any given point plus a …
accessible through a stochastic oracle that returns the gradient at any given point plus a …
A variational analysis of stochastic gradient algorithms
Abstract Stochastic Gradient Descent (SGD) is an important algorithm in machine learning.
With constant learning rates, it is a stochastic process that, after an initial phase of …
With constant learning rates, it is a stochastic process that, after an initial phase of …
Dissipativity theory for Nesterov's accelerated method
In this paper, we adapt the control theoretic concept of dissipativity theory to provide a
natural understanding of Nesterov's accelerated method. Our theory ties rigorous …
natural understanding of Nesterov's accelerated method. Our theory ties rigorous …