Max-margin token selection in attention mechanism

D Ataee Tarzanagh, Y Li, X Zhang… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …

Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time

Y Gao, Z Song, W Wang, J Yin - arxiv preprint arxiv:2309.07418, 2023‏ - arxiv.org
Large language models (LLMs) have played a pivotal role in revolutionizing various facets
of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs …

Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent

Z Li, T Wang, JD Lee, S Arora - Advances in Neural …, 2022‏ - proceedings.neurips.cc
As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …

The power and limitation of pretraining-finetuning for linear regression under covariate shift

J Wu, D Zou, V Braverman, Q Gu… - Advances in Neural …, 2022‏ - proceedings.neurips.cc
We study linear regression under covariate shift, where the marginal distribution over the
input covariates differs in the source and the target domains, while the conditional …

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

C Paquette, E Paquette, B Adlam… - Mathematical …, 2024‏ - Springer
We develop a stochastic differential equation, called homogenized SGD, for analyzing the
dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares …

Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression

J Wu, D Zou, V Braverman, Q Gu… - … on Machine Learning, 2022‏ - proceedings.mlr.press
Stochastic gradient descent (SGD) has been shown to generalize well in many deep
learning applications. In practice, one often runs SGD with a geometrically decaying …

Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions

C Paquette, E Paquette, B Adlam… - Advances in Neural …, 2022‏ - proceedings.neurips.cc
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-
to optimization algorithm for a diverse array of problems. While the empirical success of SGD …

How transformers utilize multi-head attention in in-context learning? a case study on sparse linear regression

X Chen, L Zhao, D Zou - arxiv preprint arxiv:2408.04532, 2024‏ - arxiv.org
Despite the remarkable success of transformer-based models in various real-world tasks,
their underlying mechanisms remain poorly understood. Recent studies have suggested that …

Finite-sample analysis of learning high-dimensional single ReLU neuron

J Wu, D Zou, Z Chen, V Braverman… - International …, 2023‏ - proceedings.mlr.press
This paper considers the problem of learning single ReLU neuron with squared loss (aka,
ReLU regression) in the overparameterized regime, where the input dimension can exceed …