Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arxiv preprint arxiv …, 2024 - arxiv.org
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

Contextcite: Attributing model generation to context

B Cohen-Wang, H Shah… - Advances in Neural …, 2025 - proceedings.neurips.cc
How do language models use information provided as context when generating a
response? Can we infer whether a particular generated statement is actually grounded in …

Attribution patching outperforms automated circuit discovery

A Syed, C Rager, A Conmy - arxiv preprint arxiv:2310.10348, 2023 - arxiv.org
Automated interpretability research has recently attracted attention as a potential research
direction that could scale explanations of neural network behavior to large models. Existing …

Information flow routes: Automatically interpreting language models at scale

J Ferrando, E Voita - arxiv preprint arxiv:2403.00824, 2024 - arxiv.org
Information flows by routes inside the network via mechanisms implemented in the model.
These routes can be represented as graphs where nodes correspond to token …

Decomposing and editing predictions by modeling model computation

H Shah, A Ilyas, A Madry - arxiv preprint arxiv:2404.11534, 2024 - arxiv.org
How does the internal computation of a machine learning model transform inputs into
predictions? In this paper, we introduce a task called component modeling that aims to …

AI and the Problem of Knowledge Collapse

AJ Peterson - AI & SOCIETY, 2025 - Springer
While artificial intelligence has the potential to process vast amounts of data, generate new
insights, and unlock greater productivity, its widespread adoption may entail unforeseen …

Improving sparse decomposition of language model activations with gated sparse autoencoders

S Rajamanoharan, A Conmy, L Smith… - Advances in …, 2025 - proceedings.neurips.cc
Recent work has found that sparse autoencoders (SAEs) are an effective technique for
unsupervised discovery of interpretable features in language models'(LMs) activations, by …

Llm circuit analyses are consistent across training and scale

C Tigges, M Hanna, Q Yu, S Biderman - arxiv preprint arxiv:2407.10827, 2024 - arxiv.org
Most currently deployed large language models (LLMs) undergo continuous training or
additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on …

Answer, assemble, ace: Understanding how transformers answer multiple choice questions

S Wiegreffe, O Tafjord, Y Belinkov, H Hajishirzi… - arxiv preprint arxiv …, 2024 - arxiv.org
Multiple-choice question answering (MCQA) is a key competence of performant transformer
language models that is tested by mainstream benchmarks. However, recent evidence …