Transformers as support vector machines
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …
revolutionary advancements in NLP. The attention layer within the transformer admits a …
Simplifying transformer blocks
A simple design recipe for deep Transformers is to compose identical building blocks. But
standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks …
standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks …
Lora+: Efficient low rank adaptation of large models
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …
Attention with markov: A framework for principled analysis of transformers via markov chains
In recent years, attention-based transformers have achieved tremendous success across a
variety of disciplines including natural languages. A key ingredient behind their success is …
variety of disciplines including natural languages. A key ingredient behind their success is …
Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit
The cost of hyperparameter tuning in deep learning has been rising with model sizes,
prompting practitioners to find new tuning methods using a proxy of smaller networks. One …
prompting practitioners to find new tuning methods using a proxy of smaller networks. One …
Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond
The softmax activation function plays a crucial role in the success of large language models
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …
Measure-to-measure interpolation using Transformers
Transformers are deep neural network architectures that underpin the recent successes of
large language models. Unlike more classical architectures that can be viewed as point-to …
large language models. Unlike more classical architectures that can be viewed as point-to …
Towards training without depth limits: Batch normalization without gradient explosion
Normalization layers are one of the key building blocks for deep neural networks. Several
theoretical studies have shown that batch normalization improves the signal propagation, by …
theoretical studies have shown that batch normalization improves the signal propagation, by …
Dynamic metastability in the self-attention model
B Geshkovski, H Koubbi, Y Polyanskiy… - arxiv preprint arxiv …, 2024 - arxiv.org
We consider the self-attention model-an interacting particle system on the unit sphere, which
serves as a toy model for Transformers, the deep neural network architecture behind the …
serves as a toy model for Transformers, the deep neural network architecture behind the …
On feature learning in structured state space models
This paper studies the scaling behavior of state-space models (SSMs) and their structured
variants, such as Mamba, that have recently arisen in popularity as alternatives to …
variants, such as Mamba, that have recently arisen in popularity as alternatives to …