Robust training under label noise by over-parameterization
Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …
than training samples, have dominated the performances of modern machine learning …
Max-margin token selection in attention mechanism
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
Sgd with large step sizes learns sparse features
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …
in the training of neural networks. We present empirical observations that commonly used …
Transformers as support vector machines
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …
revolutionary advancements in NLP. The attention layer within the transformer admits a …
Label noise sgd provably prefers flat global minimizers
In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly
regularizes the optimization trajectory and determines which local minimum SGD converges …
regularizes the optimization trajectory and determines which local minimum SGD converges …
Saddle-to-saddle dynamics in diagonal linear networks
In this paper we fully describe the trajectory of gradient flow over $2 $-layer diagonal linear
networks for the regression setting in the limit of vanishing initialisation. We show that the …
networks for the regression setting in the limit of vanishing initialisation. We show that the …
Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity
Understanding the implicit bias of training algorithms is of crucial importance in order to
explain the success of overparametrised neural networks. In this paper, we study the …
explain the success of overparametrised neural networks. In this paper, we study the …
(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …
What Happens after SGD Reaches Zero Loss?--A Mathematical Framework
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key
challenges in deep learning, especially for overparametrized models, where the local …
challenges in deep learning, especially for overparametrized models, where the local …
Implicit bias of the step size in linear diagonal neural networks
Focusing on diagonal linear networks as a model for understanding the implicit bias in
underdetermined models, we show how the gradient descent step size can have a large …
underdetermined models, we show how the gradient descent step size can have a large …