Dissecting the interplay of attention paths in a statistical mechanics theory of transformers

L Tiberi, F Mignacco, K Irie… - Advances in Neural …, 2025 - proceedings.neurips.cc
Despite the remarkable empirical performance of Transformers, their theoretical
understanding remains elusive. Here, we consider a deep multi-head self-attention network …

Transformers are minimax optimal nonparametric in-context learners

J Kim, T Nakamaki, T Suzuki - Advances in Neural …, 2025 - proceedings.neurips.cc
In-context learning (ICL) of large language models has proven to be a surprisingly effective
method of learning a new task from only a few demonstrative examples. In this paper, we …

In-context learning with representations: Contextual generalization of trained transformers

T Yang, Y Huang, Y Liang… - Advances in Neural …, 2025 - proceedings.neurips.cc
In-context learning (ICL) refers to a remarkable capability of pretrained large language
models, which can learn a new task given a few examples during inference. However …

Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

J Jiang, W Huang, M Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc
Transformers have demonstrated great power in the recent development of large
foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary …

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

B Chen, X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2410.11268, 2024 - arxiv.org
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …

Transformers learn nonlinear features in context: Nonconvex mean-field dynamics on the attention landscape

J Kim, T Suzuki - arxiv preprint arxiv:2402.01258, 2024 - arxiv.org
Large language models based on the Transformer architecture have demonstrated
impressive capabilities to learn in context. However, existing theoretical studies on how this …

Pretrained transformer efficiently learns low-dimensional target functions in-context

K Oko, Y Song, T Suzuki, D Wu - Advances in Neural …, 2025 - proceedings.neurips.cc
Transformers can efficiently learn in-context from example demonstrations. Most existing
theoretical analyses studied the in-context learning (ICL) ability of transformers for linear …

How does promoting the minority fraction affect generalization? a theoretical study of one-hidden-layer neural network on group imbalance

H Li, S Zhang, Y Zhang, M Wang, S Liu… - IEEE Journal of …, 2024 - ieeexplore.ieee.org
Group imbalance has been a known problem in empirical risk minimization (ERM), where
the achieved high average accuracy is accompanied by low accuracy in a minority group …

On mesa-optimization in autoregressively trained transformers: Emergence and capability

C Zheng, W Huang, R Wang, G Wu… - Advances in Neural …, 2025 - proceedings.neurips.cc
Autoregressively trained transformers have brought a profound revolution to the world,
especially with their in-context learning (ICL) ability to address downstream tasks. Recently …