Transformers as statisticians: Provable in-context learning with in-context algorithm selection

Y Bai, F Chen, H Wang, C **ong… - Advances in neural …, 2023 - proceedings.neurips.cc
Neural sequence models based on the transformer architecture have demonstrated
remarkable\emph {in-context learning}(ICL) abilities, where they can perform new tasks …

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

S Chen, H Sheen, T Wang, Z Yang - arxiv preprint arxiv:2402.19442, 2024 - arxiv.org
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …

Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency

Z Liu, H Hu, S Zhang, H Guo, S Ke, B Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating
reasoning into actions in the real world remains challenging. In particular, it remains unclear …

Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input

S Takakura, T Suzuki - International Conference on Machine …, 2023 - proceedings.mlr.press
Despite the great success of Transformer networks in various applications such as natural
language processing and computer vision, their theoretical aspects are not well understood …

A mechanism for sample-efficient in-context learning for sparse retrieval tasks

J Abernethy, A Agarwal, TV Marinov… - International …, 2024 - proceedings.mlr.press
We study the phenomenon of in-context learning (ICL) exhibited by large language models,
where they can adapt to a new learning task, given a handful of labeled examples, without …

Unveiling induction heads: Provable training dynamics and feature learning in transformers

S Chen, H Sheen, T Wang, Z Yang - arxiv preprint arxiv:2409.10559, 2024 - arxiv.org
In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its
theoretical foundations remain elusive due to the complexity of transformer architectures. In …

Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data

A Havrilla, W Liao - arxiv preprint arxiv:2411.06646, 2024 - arxiv.org
When training deep neural networks, a model's generalization error is often observed to
follow a power scaling law dependent both on the model size and the data size. Perhaps the …

Sequence length independent norm-based generalization bounds for transformers

J Trauger, A Tewari - International Conference on Artificial …, 2024 - proceedings.mlr.press
This paper provides norm-based generalization bounds for the Transformer architecture that
do not depend on the input sequence length. We employ a covering number based …

Reason for future, act for now: A principled architecture for autonomous llm agents

Z Liu, H Hu, S Zhang, H Guo, S Ke, B Liu… - Forty-first International …, 2024 - openreview.net
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating
reasoning into actions in the real world remains challenging. In particular, it is unclear how …

Provable Convergence of Single-Timescale Neural Actor-Critic in Continuous Spaces

X Chen, F Zhang, G Wang, L Zhao - openreview.net
Actor-critic (AC) algorithms have been the powerhouse behind many successful yet
challenging applications. However, the theoretical understanding of finite-time convergence …