Partially Rewriting a Transformer in Natural Language
The greatest ambition of mechanistic interpretability is to completely rewrite deep neural
networks in a format that is more amenable to human understanding, while preserving their …
networks in a format that is more amenable to human understanding, while preserving their …
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Automated interpretability pipelines generate natural language descriptions for the concepts
represented by features in large language models (LLMs), such as plants or the first word in …
represented by features in large language models (LLMs), such as plants or the first word in …
Sparse Autoencoders Trained on the Same Data Learn Different Features
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features
in the activations of large language models (LLMs). While some expect SAEs to find the true …
in the activations of large language models (LLMs). While some expect SAEs to find the true …
Propositional interpretability in artificial intelligence
DJ Chalmers - arxiv preprint arxiv:2501.15740, 2025 - arxiv.org
Mechanistic interpretability is the program of explaining what AI systems are doing in terms
of their internal mechanisms. I analyze some aspects of the program, along with setting out …
of their internal mechanisms. I analyze some aspects of the program, along with setting out …
Sparse Autoencoders Can Interpret Randomly Initialized Transformers
Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the
internal representations of transformers. In this paper, we apply SAEs to'interpret'random …
internal representations of transformers. In this paper, we apply SAEs to'interpret'random …
Steering Large Language Models with Feature Guided Activation Additions
S Soo, W Teng, C Balaganesh - arxiv preprint arxiv:2501.09929, 2025 - arxiv.org
Effective and reliable control over large language model (LLM) behavior is a significant
challenge. While activation steering methods, which add steering vectors to a model's …
challenge. While activation steering methods, which add steering vectors to a model's …