Robust training under label noise by over-parameterization

S Liu, Z Zhu, Q Qu, C You - International Conference on …, 2022 - proceedings.mlr.press
Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arxiv preprint arxiv …, 2023 - arxiv.org
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Label noise sgd provably prefers flat global minimizers

A Damian, T Ma, JD Lee - Advances in Neural Information …, 2021 - proceedings.neurips.cc
In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly
regularizes the optimization trajectory and determines which local minimum SGD converges …

Saddle-to-saddle dynamics in diagonal linear networks

S Pesme, N Flammarion - Advances in Neural Information …, 2023 - proceedings.neurips.cc
In this paper we fully describe the trajectory of gradient flow over $2 $-layer diagonal linear
networks for the regression setting in the limit of vanishing initialisation. We show that the …

Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity

S Pesme, L Pillaud-Vivien… - Advances in Neural …, 2021 - proceedings.neurips.cc
Understanding the implicit bias of training algorithms is of crucial importance in order to
explain the success of overparametrised neural networks. In this paper, we study the …

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

M Even, S Pesme, S Gunasekar… - Advances in Neural …, 2023 - proceedings.neurips.cc
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework

Z Li, T Wang, S Arora - arxiv preprint arxiv:2110.06914, 2021 - arxiv.org
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key
challenges in deep learning, especially for overparametrized models, where the local …

Implicit bias of the step size in linear diagonal neural networks

MS Nacson, K Ravichandran… - International …, 2022 - proceedings.mlr.press
Focusing on diagonal linear networks as a model for understanding the implicit bias in
underdetermined models, we show how the gradient descent step size can have a large …