Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Y Jiang, G Rajendran, P Ravikumar… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have the capacity to store and recall facts. Through
experimentation with open-source models, we observe that this ability to retrieve facts can …

Large language models as markov chains

O Zekri, A Odonnat, A Benechehab, L Bleistein… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have proven to be remarkably efficient, both across a wide
range of natural language processing tasks and well beyond them. However, a …

On the power of convolution augmented transformer

M Li, X Zhang, Y Huang, S Oymak - arxiv preprint arxiv:2407.05591, 2024 - arxiv.org
The transformer architecture has catalyzed revolutionary advances in language modeling.
However, recent architectural recipes, such as state-space models, have bridged the …

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

AAK Julistiono, DA Tarzanagh, N Azizan - arxiv preprint arxiv:2410.14581, 2024 - arxiv.org
Attention mechanisms have revolutionized several domains of artificial intelligence, such as
natural language processing and computer vision, by enabling models to selectively focus …

How Do Nonlinear Transformers Acquire Generalization-Guaranteed CoT Ability?

H Li, M Wang, S Lu, X Cui, PY Chen - High-dimensional Learning …, 2024 - openreview.net
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with …

Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

H Li, M Wang, S Lu, X Cui, PY Chen - arxiv preprint arxiv:2410.02167, 2024 - arxiv.org
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability
of large language models by augmenting the query using multiple examples with multiple …

Training Dynamics of In-Context Learning in Linear Attention

Y Zhang, AK Singh, PE Latham, A Saxe - arxiv preprint arxiv:2501.16265, 2025 - arxiv.org
While attention-based models have demonstrated the remarkable ability of in-context
learning, the theoretical understanding of how these models acquired this ability through …

Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Y Cao, Y He, D Wu, HY Chen, J Fan, H Liu - arxiv preprint arxiv …, 2025 - arxiv.org
Transformers have achieved significant success in various fields, notably excelling in tasks
involving sequential data like natural language processing. Despite these achievements, the …

Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

J Kim, D Wu, J Lee, T Suzuki - arxiv preprint arxiv:2502.01694, 2025 - arxiv.org
A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to
allocate more inference-time compute to search against a verifier or reward model. This …

Local to Global: Learning Dynamics and Effect of Initialization for Transformers

AV Makkuva, M Bondaschi, C Ekbote, A Girish… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, transformer-based models have revolutionized deep learning, particularly in
sequence modeling. To better understand this phenomenon, there is a growing interest in …