Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in neural …, 2023 - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arxiv preprint arxiv …, 2023 - arxiv.org
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Same pre-training loss, better downstream: Implicit bias matters for language models

H Liu, SM **e, Z Li, T Ma - International Conference on …, 2023 - proceedings.mlr.press
Abstract Language modeling on large-scale datasets improves performance of various
downstream tasks. The validation pre-training loss is often used as the evaluation metric for …

Saddle-to-saddle dynamics in diagonal linear networks

S Pesme, N Flammarion - Advances in Neural Information …, 2023 - proceedings.neurips.cc
In this paper we fully describe the trajectory of gradient flow over $2 $-layer diagonal linear
networks for the regression setting in the limit of vanishing initialisation. We show that the …

On the implicit bias of initialization shape: Beyond infinitesimal mirror descent

S Azulay, E Moroshko, MS Nacson… - International …, 2021 - proceedings.mlr.press
Recent work has highlighted the role of initialization scale in determining the structure of the
solutions that gradient methods converge to. In particular, it was shown that large …

A precise high-dimensional asymptotic theory for boosting and minimum--norm interpolated classifiers

T Liang, P Sur - The Annals of Statistics, 2022 - projecteuclid.org
A precise high-dimensional asymptotic theory for boosting and minimum-l1-norm
interpolated classifiers Page 1 The Annals of Statistics 2022, Vol. 50, No. 3, 1669–1695 …

Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent

Z Li, T Wang, JD Lee, S Arora - Advances in Neural …, 2022 - proceedings.neurips.cc
As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …

Implicit bias of mirror flow on separable data

S Pesme, RA Dragomir… - Advances in Neural …, 2025 - proceedings.neurips.cc
We examine the continuous-time counterpart of mirror descent, namely mirror flow, on
classification problems which are linearly separable. Such problems are minimised 'at …

Reparameterizing mirror descent as gradient descent

E Amid, MKK Warmuth - Advances in Neural Information …, 2020 - proceedings.neurips.cc
Most of the recent successful applications of neural networks have been based on training
with gradient descent updates. However, for some small networks, other mirror descent …

Convergence rates of gradient methods for convex optimization in the space of measures

L Chizat - Open Journal of Mathematical Optimization, 2022 - numdam.org
We study the convergence rate of Bregman gradient methods for convex optimization in the
space of measures on a d-dimensional manifold. Under basic regularity assumptions, we …