Partially Rewriting a Transformer in Natural Language

G Paulo, N Belrose - arxiv preprint arxiv:2501.18838, 2025 - arxiv.org
The greatest ambition of mechanistic interpretability is to completely rewrite deep neural
networks in a format that is more amenable to human understanding, while preserving their …

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Y Gur-Arieh, R Mayan, C Agassy, A Geiger… - arxiv preprint arxiv …, 2025 - arxiv.org
Automated interpretability pipelines generate natural language descriptions for the concepts
represented by features in large language models (LLMs), such as plants or the first word in …

Sparse Autoencoders Trained on the Same Data Learn Different Features

G Paulo, N Belrose - arxiv preprint arxiv:2501.16615, 2025 - arxiv.org
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features
in the activations of large language models (LLMs). While some expect SAEs to find the true …

Propositional interpretability in artificial intelligence

DJ Chalmers - arxiv preprint arxiv:2501.15740, 2025 - arxiv.org
Mechanistic interpretability is the program of explaining what AI systems are doing in terms
of their internal mechanisms. I analyze some aspects of the program, along with setting out …

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

T Heap, T Lawson, L Farnik, L Aitchison - arxiv preprint arxiv:2501.17727, 2025 - arxiv.org
Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the
internal representations of transformers. In this paper, we apply SAEs to'interpret'random …

Steering Large Language Models with Feature Guided Activation Additions

S Soo, W Teng, C Balaganesh - arxiv preprint arxiv:2501.09929, 2025 - arxiv.org
Effective and reliable control over large language model (LLM) behavior is a significant
challenge. While activation steering methods, which add steering vectors to a model's …