Transformers as support vector machines

DA Tarzanagh, Y Li, C Thrampoulidis… - arxiv preprint arxiv …, 2023 - arxiv.org
Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Simplifying transformer blocks

B He, T Hofmann - arxiv preprint arxiv:2311.01906, 2023 - arxiv.org
A simple design recipe for deep Transformers is to compose identical building blocks. But
standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks …

Lora+: Efficient low rank adaptation of large models

S Hayou, N Ghosh, B Yu - arxiv preprint arxiv:2402.12354, 2024 - arxiv.org
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …

Attention with markov: A framework for principled analysis of transformers via markov chains

AV Makkuva, M Bondaschi, A Girish, A Nagle… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, attention-based transformers have achieved tremendous success across a
variety of disciplines including natural languages. A key ingredient behind their success is …

Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit

B Bordelon, L Noci, MB Li, B Hanin… - arxiv preprint arxiv …, 2023 - arxiv.org
The cost of hyperparameter tuning in deep learning has been rising with model sizes,
prompting practitioners to find new tuning methods using a proxy of smaller networks. One …

Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond

J Gu, C Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2405.03251, 2024 - arxiv.org
The softmax activation function plays a crucial role in the success of large language models
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …

Measure-to-measure interpolation using Transformers

B Geshkovski, P Rigollet, D Ruiz-Balet - arxiv preprint arxiv:2411.04551, 2024 - arxiv.org
Transformers are deep neural network architectures that underpin the recent successes of
large language models. Unlike more classical architectures that can be viewed as point-to …

Towards training without depth limits: Batch normalization without gradient explosion

A Meterez, A Joudaki, F Orabona, A Immer… - arxiv preprint arxiv …, 2023 - arxiv.org
Normalization layers are one of the key building blocks for deep neural networks. Several
theoretical studies have shown that batch normalization improves the signal propagation, by …

Dynamic metastability in the self-attention model

B Geshkovski, H Koubbi, Y Polyanskiy… - arxiv preprint arxiv …, 2024 - arxiv.org
We consider the self-attention model-an interacting particle system on the unit sphere, which
serves as a toy model for Transformers, the deep neural network architecture behind the …

On feature learning in structured state space models

LC Vankadara, J Xu, M Haas… - The Thirty-eighth Annual …, 2024 - openreview.net
This paper studies the scaling behavior of state-space models (SSMs) and their structured
variants, such as Mamba, that have recently arisen in popularity as alternatives to …