Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …
How transformers learn causal structure with gradient descent
The incredible success of transformers on sequence modeling tasks can be largely
attributed to the self-attention mechanism, which allows information to be transferred …
attributed to the self-attention mechanism, which allows information to be transferred …
Galore: Memory-efficient llm training by gradient low-rank projection
Training Large Language Models (LLMs) presents significant memory challenges,
predominantly due to the growing size of weights and optimizer states. Common memory …
predominantly due to the growing size of weights and optimizer states. Common memory …
Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality
We study the dynamics of gradient flow for training a multi-head softmax attention model for
in-context learning of multi-task linear regression. We establish the global convergence of …
in-context learning of multi-task linear regression. We establish the global convergence of …
A primer on the inner workings of transformer-based language models
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …
language models has highlighted a need for contextualizing the insights gained from years …
An information-theoretic analysis of in-context learning
Previous theoretical results pertaining to meta-learning on sequences build on contrived
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …
assumptions and are somewhat convoluted. We introduce new information-theoretic tools …
[PDF][PDF] Training nonlinear transformers for efficient in-context learning: A theoretical learning and generalization analysis
Transformer-based large language models have displayed impressive in-context learning
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …
capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply …
Unveiling induction heads: Provable training dynamics and feature learning in transformers
In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its
theoretical foundations remain elusive due to the complexity of transformer architectures. In …
theoretical foundations remain elusive due to the complexity of transformer architectures. In …
In-context learning with transformers: Softmax attention adapts to function lipschitzness
A striking property of transformers is their ability to perform in-context learning (ICL), a
machine learning framework in which the learner is presented with a novel context during …
machine learning framework in which the learner is presented with a novel context during …
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …
experimentation with open-source models, we observe that this ability to retrieve facts can …