Max-margin token selection in attention mechanism
Attention mechanism is a central component of the transformer architecture which led to the
phenomenal success of large language models. However, the theoretical principles …
phenomenal success of large language models. However, the theoretical principles …
Transformers as support vector machines
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …
revolutionary advancements in NLP. The attention layer within the transformer admits a …
A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time
Large language models (LLMs) have played a pivotal role in revolutionizing various facets
of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs …
of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs …
Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent
As part of the effort to understand implicit bias of gradient descent in overparametrized
models, several results have shown how the training trajectory on the overparametrized …
models, several results have shown how the training trajectory on the overparametrized …
The power and limitation of pretraining-finetuning for linear regression under covariate shift
We study linear regression under covariate shift, where the marginal distribution over the
input covariates differs in the source and the target domains, while the conditional …
input covariates differs in the source and the target domains, while the conditional …
Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties
We develop a stochastic differential equation, called homogenized SGD, for analyzing the
dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares …
dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares …
Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression
Stochastic gradient descent (SGD) has been shown to generalize well in many deep
learning applications. In practice, one often runs SGD with a geometrically decaying …
learning applications. In practice, one often runs SGD with a geometrically decaying …
Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-
to optimization algorithm for a diverse array of problems. While the empirical success of SGD …
to optimization algorithm for a diverse array of problems. While the empirical success of SGD …
How transformers utilize multi-head attention in in-context learning? a case study on sparse linear regression
Despite the remarkable success of transformer-based models in various real-world tasks,
their underlying mechanisms remain poorly understood. Recent studies have suggested that …
their underlying mechanisms remain poorly understood. Recent studies have suggested that …
Finite-sample analysis of learning high-dimensional single ReLU neuron
This paper considers the problem of learning single ReLU neuron with squared loss (aka,
ReLU regression) in the overparameterized regime, where the input dimension can exceed …
ReLU regression) in the overparameterized regime, where the input dimension can exceed …