Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions

O Shorinwa, Z Mei, J Lidard, AZ Ren… - arxiv preprint arxiv …, 2024 - arxiv.org
The remarkable performance of large language models (LLMs) in content generation,
coding, and common-sense reasoning has spurred widespread integration into many facets …

Mechanistic?

N Saphra, S Wiegreffe - arxiv preprint arxiv:2410.09087, 2024 - arxiv.org
The rise of the term" mechanistic interpretability" has accompanied increasing interest in
understanding neural models--particularly language models. However, this jargon has also …

Open Problems in Mechanistic Interpretability

L Sharkey, B Chughtai, J Batson, J Lindsey… - arxiv preprint arxiv …, 2025 - arxiv.org
Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …

Partially Rewriting a Transformer in Natural Language

G Paulo, N Belrose - arxiv preprint arxiv:2501.18838, 2025 - arxiv.org
The greatest ambition of mechanistic interpretability is to completely rewrite deep neural
networks in a format that is more amenable to human understanding, while preserving their …

Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning

Y Poupart, A Beynier, N Maudet - arxiv preprint arxiv:2502.00726, 2025 - arxiv.org
Multi-Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex
problems in robotics or games, yet most of the trained models are hard to interpret. While …

Sparse Autoencoders Do Not Find Canonical Units of Analysis

P Leask, B Bussmann, M Pearce, J Bloom… - arxiv preprint arxiv …, 2025 - arxiv.org
A common goal of mechanistic interpretability is to decompose the activations of neural
networks into features: interpretable properties of the input computed by the model. Sparse …

Universal Response and Emergence of Induction in LLMs

N Luick - arxiv preprint arxiv:2411.07071, 2024 - arxiv.org
While induction is considered a key mechanism for in-context learning in LLMs,
understanding its precise circuit decomposition beyond toy models remains elusive. Here …

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

D Laptev, N Balagansky, Y Aksenov… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce a new approach to systematically map features discovered by sparse
autoencoder across consecutive layers of large language models, extending earlier work …