- Academic Search

DA Tarzanagh, Y Li, C Thrampoulidis… - arxiv preprint arxiv …, 2023 - arxiv.org

Since its inception in" Attention Is All You Need", transformer architecture has led to
revolutionary advancements in NLP. The attention layer within the transformer admits a …

Enregistrer Citer Cité 79 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Simplifying transformer blocks

B He, T Hofmann - arxiv preprint arxiv:2311.01906, 2023 - arxiv.org

A simple design recipe for deep Transformers is to compose identical building blocks. But
standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks …

Enregistrer Citer Cité 38 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Lora+: Efficient low rank adaptation of large models

S Hayou, N Ghosh, B Yu - arxiv preprint arxiv:2402.12354, 2024 - arxiv.org

In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et
al.(2021) leads to suboptimal finetuning of models with large width (embedding dimension) …

Enregistrer Citer Cité 55 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Attention with markov: A framework for principled analysis of transformers via markov chains

AV Makkuva, M Bondaschi, A Girish, A Nagle… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, attention-based transformers have achieved tremendous success across a
variety of disciplines including natural languages. A key ingredient behind their success is …

Enregistrer Citer Cité 24 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit

B Bordelon, L Noci, MB Li, B Hanin… - arxiv preprint arxiv …, 2023 - arxiv.org

The cost of hyperparameter tuning in deep learning has been rising with model sizes,
prompting practitioners to find new tuning methods using a proxy of smaller networks. One …

Enregistrer Citer Cité 21 fois Autres articles Les 7 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond

J Gu, C Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2405.03251, 2024 - arxiv.org

The softmax activation function plays a crucial role in the success of large language models
(LLMs), particularly in the self-attention mechanism of the widely adopted Transformer …

Enregistrer Citer Cité 14 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Measure-to-measure interpolation using Transformers

B Geshkovski, P Rigollet, D Ruiz-Balet - arxiv preprint arxiv:2411.04551, 2024 - arxiv.org

Transformers are deep neural network architectures that underpin the recent successes of
large language models. Unlike more classical architectures that can be viewed as point-to …

Enregistrer Citer Cité 4 fois Autres articles Les 5 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Towards training without depth limits: Batch normalization without gradient explosion

A Meterez, A Joudaki, F Orabona, A Immer… - arxiv preprint arxiv …, 2023 - arxiv.org

Normalization layers are one of the key building blocks for deep neural networks. Several
theoretical studies have shown that batch normalization improves the signal propagation, by …

Enregistrer Citer Cité 6 fois Autres articles Les 6 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Dynamic metastability in the self-attention model

B Geshkovski, H Koubbi, Y Polyanskiy… - arxiv preprint arxiv …, 2024 - arxiv.org

We consider the self-attention model-an interacting particle system on the unit sphere, which
serves as a toy model for Transformers, the deep neural network architecture behind the …

Enregistrer Citer Cité 3 fois Autres articles Les 5 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] openreview.net

On feature learning in structured state space models

LC Vankadara, J Xu, M Haas… - The Thirty-eighth Annual …, 2024 - openreview.net

This paper studies the scaling behavior of state-space models (SSMs) and their structured
variants, such as Mamba, that have recently arisen in popularity as alternatives to …

Enregistrer Citer Cité 2 fois Autres articles Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

The shaped transformer: Attention models in the infinite depth-and-width limit

Transformers as support vector machines

Simplifying transformer blocks

Lora+: Efficient low rank adaptation of large models

Attention with markov: A framework for principled analysis of transformers via markov chains

Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit

Exploring the frontiers of softmax: Provable optimization, applications in diffusion model, and beyond

Measure-to-measure interpolation using Transformers

Towards training without depth limits: Batch normalization without gradient explosion

Dynamic metastability in the self-attention model

On feature learning in structured state space models