- Academic Search

S Liu, Z Zhu, Q Qu, C You - International Conference on …, 2022 - proceedings.mlr.press

Recently, over-parameterized deep networks, with increasingly more network parameters
than training samples, have dominated the performances of modern machine learning …

Save Cite Cited by 132 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc

Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Save Cite Cited by 49 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] mlr.press

Sgd with large step sizes learns sparse features

M Andriushchenko, AV Varre… - International …, 2023 - proceedings.mlr.press

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD)
in the training of neural networks. We present empirical observations that commonly used …

Save Cite Cited by 61 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arxiv preprint arxiv …, 2023 - arxiv.org

Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Save Cite Cited by 79 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Label noise sgd provably prefers flat global minimizers

A Damian, T Ma, JD Lee - Advances in Neural Information …, 2021 - proceedings.neurips.cc

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly
regularizes the optimization trajectory and determines which local minimum SGD converges …

Save Cite Cited by 133 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Saddle-to-saddle dynamics in diagonal linear networks

S Pesme, N Flammarion - Advances in Neural Information …, 2023 - proceedings.neurips.cc

In this paper we fully describe the trajectory of gradient flow over $2 $-layer diagonal linear
networks for the regression setting in the limit of vanishing initialisation. We show that the …

Save Cite Cited by 35 Related articles All 8 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity

S Pesme, L Pillaud-Vivien… - Advances in Neural …, 2021 - proceedings.neurips.cc

Understanding the implicit bias of training algorithms is of crucial importance in order to
explain the success of overparametrised neural networks. In this paper, we study the …

Save Cite Cited by 118 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

M Even, S Pesme, S Gunasekar… - Advances in Neural …, 2023 - proceedings.neurips.cc

In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit
regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over $2 …

Save Cite Cited by 14 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework

Z Li, T Wang, S Arora - arxiv preprint arxiv:2110.06914, 2021 - arxiv.org

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key
challenges in deep learning, especially for overparametrized models, where the local …

Save Cite Cited by 111 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] mlr.press

Implicit bias of the step size in linear diagonal neural networks

MS Nacson, K Ravichandran… - International …, 2022 - proceedings.mlr.press

Focusing on diagonal linear networks as a model for understanding the implicit bias in
underdetermined models, we show how the gradient descent step size can have a large …

Save Cite Cited by 55 Related articles All 3 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Implicit regularization for optimal sparse recovery

Robust training under label noise by over-parameterization

Max-margin token selection in attention mechanism

Sgd with large step sizes learns sparse features

Transformers as support vector machines

Label noise sgd provably prefers flat global minimizers

Saddle-to-saddle dynamics in diagonal linear networks

Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability

What Happens after SGD Reaches Zero Loss?--A Mathematical Framework

Implicit bias of the step size in linear diagonal neural networks