Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Towards principled evaluations of sparse autoencoders for interpretability and control

A Makelov, G Lange, N Nanda - arxiv preprint arxiv:2405.08366, 2024 - arxiv.org
Disentangling model activations into meaningful features is a central problem in
interpretability. However, the absence of ground-truth for these features in realistic scenarios …

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Relational composition in neural networks: A survey and call to action

M Wattenberg, FB Viégas - arxiv preprint arxiv:2407.14662, 2024 - arxiv.org
Many neural nets appear to represent data as linear combinations of" feature vectors."
Algorithms for discovering these vectors have seen impressive recent success. However, we …

Mathematical models of computation in superposition

K Hänni, J Mendel, D Vaintrob, L Chan - arxiv preprint arxiv:2408.05451, 2024 - arxiv.org
Superposition--when a neural network represents more``features''than it has dimensions--
seems to pose a serious challenge to mechanistically interpreting current AI systems …

Sparse autoencoders reveal universal feature spaces across large language models

M Lan, P Torr, A Meek, A Khakzar, D Krueger… - arxiv preprint arxiv …, 2024 - arxiv.org
We investigate feature universality in large language models (LLMs), a research field that
aims to understand how different models similarly represent concepts in the latent spaces of …

Disentangling dense embeddings with sparse autoencoders

C O'Neill, C Ye, K Iyer, JF Wu - arxiv preprint arxiv:2408.00657, 2024 - arxiv.org
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from
complex neural networks. We present one of the first applications of SAEs to dense text …

Explaining AI through mechanistic interpretability

L Kästner, B Crook - European Journal for Philosophy of Science, 2024 - Springer
Recent work in explainable artificial intelligence (XAI) attempts to render opaque AI systems
understandable through a divide-and-conquer strategy. However, this fails to illuminate how …