Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

B Chen, X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2410.11268, 2024 - arxiv.org
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …

How transformers learn causal structure with gradient descent

E Nichani, A Damian, JD Lee - arxiv preprint arxiv:2402.14735, 2024 - arxiv.org
The incredible success of transformers on sequence modeling tasks can be largely
attributed to the self-attention mechanism, which allows information to be transferred …

Galore: Memory-efficient llm training by gradient low-rank projection

J Zhao, Z Zhang, B Chen, Z Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Training Large Language Models (LLMs) presents significant memory challenges,
predominantly due to the growing size of weights and optimizer states. Common memory …

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

S Chen, H Sheen, T Wang, Z Yang - arxiv preprint arxiv:2402.19442, 2024 - arxiv.org
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

An information-theoretic analysis of in-context learning

HJ Jeon, JD Lee, Q Lei, B Van Roy - arxiv preprint arxiv:2401.15530, 2024 - arxiv.org
Previous theoretical results pertaining to meta-learning on sequences build on contrived
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …

[PDF][PDF] Training nonlinear transformers for efficient in-context learning: A theoretical learning and generalization analysis

H Li, M Wang, S Lu, X Cui, PY Chen - arxiv preprint arxiv …, 2024 - researchgate.net
Transformer-based large language models have displayed impressive in-context learning
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …

Unveiling induction heads: Provable training dynamics and feature learning in transformers

S Chen, H Sheen, T Wang, Z Yang - arxiv preprint arxiv:2409.10559, 2024 - arxiv.org
In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its
theoretical foundations remain elusive due to the complexity of transformer architectures. In …

In-context learning with transformers: Softmax attention adapts to function lipschitzness

L Collins, A Parulekar, A Mokhtari, S Sanghavi… - arxiv preprint arxiv …, 2024 - arxiv.org
A striking property of transformers is their ability to perform in-context learning (ICL), a
machine learning framework in which the learner is presented with a novel context during …

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Y Jiang, G Rajendran, P Ravikumar… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …