High-dimensional asymptotics of feature learning: How one gradient step improves the representation

J Ba, MA Erdogdu, T Suzuki, Z Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc
We study the first gradient descent step on the first-layer parameters $\boldsymbol {W} $ in a
two-layer neural network: $ f (\boldsymbol {x})=\frac {1}{\sqrt {N}}\boldsymbol {a}^\top\sigma …

Deep learning: a statistical viewpoint

PL Bartlett, A Montanari, A Rakhlin - Acta numerica, 2021 - cambridge.org
The remarkable practical success of deep learning has revealed some major surprises from
a theoretical perspective. In particular, simple gradient methods easily find near-optimal …

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

L Chizat, F Bach - Conference on learning theory, 2020 - proceedings.mlr.press
Neural networks trained to minimize the logistic (aka cross-entropy) loss with gradient-based
methods are observed to perform well in many supervised classification tasks. Towards …

On the global convergence of gradient descent for over-parameterized models using optimal transport

L Chizat, F Bach - Advances in neural information …, 2018 - proceedings.neurips.cc
Many tasks in machine learning and signal processing can be solved by minimizing a
convex function of a measure. This includes sparse spikes deconvolution or training a …

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

S Mei, T Misiakiewicz… - Conference on learning …, 2019 - proceedings.mlr.press
We consider learning two layer neural networks using stochastic gradient descent. The
mean-field description of this learning dynamics approximates the evolution of the network …

Mean-field langevin dynamics: Exponential convergence and annealing

L Chizat - arxiv preprint arxiv:2202.01009, 2022 - arxiv.org
Noisy particle gradient descent (NPGD) is an algorithm to minimize convex functions over
the space of measures that include an entropy term. In the many-particle limit, this algorithm …

Gradient descent on infinitely wide neural networks: Global convergence and generalization

F Bach, L Chizat - arxiv preprint arxiv:2110.08084, 2021 - arxiv.org
Many supervised machine learning methods are naturally cast as optimization problems. For
prediction models which are linear in their parameters, this often leads to convex problems …

Convex analysis of the mean field langevin dynamics

A Nitanda, D Wu, T Suzuki - International Conference on …, 2022 - proceedings.mlr.press
As an example of the nonlinear Fokker-Planck equation, the mean field Langevin dynamics
recently attracts attention due to its connection to (noisy) gradient descent on infinitely wide …

Sparse optimization on measures with over-parameterized gradient descent

L Chizat - Mathematical Programming, 2022 - Springer
Minimizing a convex function of a measure with a sparsity-inducing penalty is a typical
problem arising, eg, in sparse spikes deconvolution or two-layer neural networks training …

Feature learning via mean-field langevin dynamics: classifying sparse parities and beyond

T Suzuki, D Wu, K Oko… - Advances in Neural …, 2023 - proceedings.neurips.cc
Neural network in the mean-field regime is known to be capable of\textit {feature learning},
unlike the kernel (NTK) counterpart. Recent works have shown that mean-field neural …