Dive into deep learning

A Zhang, ZC Lipton, M Li, AJ Smola - arxiv preprint arxiv:2106.11342, 2021 - arxiv.org
This open-source book represents our attempt to make deep learning approachable,
teaching readers the concepts, the context, and the code. The entire book is drafted in …

Stochastic gradient descent as approximate bayesian inference

M Stephan, MD Hoffman, DM Blei - Journal of Machine Learning …, 2017 - jmlr.org
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a
Markov chain with a stationary distribution. With this perspective, we derive several new …

A variational perspective on accelerated methods in optimization

A Wibisono, AC Wilson… - proceedings of the …, 2016 - National Acad Sciences
Accelerated gradient methods play a central role in optimization, achieving optimal rates in
many settings. Although many generalizations and extensions of Nesterov's original …

Understanding the acceleration phenomenon via high-resolution differential equations

B Shi, SS Du, MI Jordan, WJ Su - Mathematical Programming, 2022 - Springer
Gradient-based optimization algorithms can be studied from the perspective of limiting
ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not …

[HTML][HTML] Why momentum really works

G Goh - Distill, 2017 - distill.pub
Why Momentum Really Works Distill About Prize Submit Why Momentum Really Works Step-size
α = 0.02 Momentum β = 0.99 We often think of Momentum as a means of dampening …

Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be

F Kunstner, J Chen, JW Lavington… - arxiv preprint arxiv …, 2023 - arxiv.org
The success of the Adam optimizer on a wide array of architectures has made it the default
in settings where stochastic gradient descent (SGD) performs poorly. However, our …

Nonparametric stochastic approximation with large step-sizes

A Dieuleveut, F Bach - 2016 - projecteuclid.org
We consider the random-design least-squares regression problem within the reproducing
kernel Hilbert space (RKHS) framework. Given a stream of independent and identically …

Harder, better, faster, stronger convergence rates for least-squares regression

A Dieuleveut, N Flammarion, F Bach - Journal of Machine Learning …, 2017 - jmlr.org
We consider the optimization of a quadratic objective function whose gradients are only
accessible through a stochastic oracle that returns the gradient at any given point plus a …

A variational analysis of stochastic gradient algorithms

S Mandt, M Hoffman, D Blei - International conference on …, 2016 - proceedings.mlr.press
Abstract Stochastic Gradient Descent (SGD) is an important algorithm in machine learning.
With constant learning rates, it is a stochastic process that, after an initial phase of …

Dissipativity theory for Nesterov's accelerated method

B Hu, L Lessard - International Conference on Machine …, 2017 - proceedings.mlr.press
In this paper, we adapt the control theoretic concept of dissipativity theory to provide a
natural understanding of Nesterov's accelerated method. Our theory ties rigorous …