Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …
experimentation with open-source models, we observe that this ability to retrieve facts can …
Large language models as markov chains
Large language models (LLMs) have proven to be remarkably efficient, both across a wide
range of natural language processing tasks and well beyond them. However, a …
range of natural language processing tasks and well beyond them. However, a …
On the power of convolution augmented transformer
The transformer architecture has catalyzed revolutionary advances in language modeling.
However, recent architectural recipes, such as state-space models, have bridged the …
However, recent architectural recipes, such as state-space models, have bridged the …
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
Attention mechanisms have revolutionized several domains of artificial intelligence, such as
natural language processing and computer vision, by enabling models to selectively focus …
natural language processing and computer vision, by enabling models to selectively focus …
How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with …
of large language models by augmenting the query using multiple examples with …
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with multiple …
of large language models by augmenting the query using multiple examples with multiple …
Training Dynamics of In-Context Learning in Linear Attention
While attention-based models have demonstrated the remarkable ability of in-context
learning, the theoretical understanding of how these models acquired this ability through …
learning, the theoretical understanding of how these models acquired this ability through …
Transformers Simulate MLE for Sequence Generation in Bayesian Networks
Transformers have achieved significant success in various fields, notably excelling in tasks
involving sequential data like natural language processing. Despite these achievements, the …
involving sequential data like natural language processing. Despite these achievements, the …
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation
J Kim, D Wu, J Lee, T Suzuki - arxiv preprint arxiv:2502.01694, 2025 - arxiv.org
A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to
allocate more inference-time compute to search against a verifier or reward model. This …
allocate more inference-time compute to search against a verifier or reward model. This …
Local to Global: Learning Dynamics and Effect of Initialization for Transformers
In recent years, transformer-based models have revolutionized deep learning, particularly in
sequence modeling. To better understand this phenomenon, there is a growing interest in …
sequence modeling. To better understand this phenomenon, there is a growing interest in …