- Academic Search

G Paulo, N Belrose - arxiv preprint arxiv:2501.18838, 2025 - arxiv.org

The greatest ambition of mechanistic interpretability is to completely rewrite deep neural
networks in a format that is more amenable to human understanding, while preserving their …

Salva Cita Articoli correlati Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Y Gur-Arieh, R Mayan, C Agassy, A Geiger… - arxiv preprint arxiv …, 2025 - arxiv.org

Automated interpretability pipelines generate natural language descriptions for the concepts
represented by features in large language models (LLMs), such as plants or the first word in …

Salva Cita Articoli correlati Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Sparse Autoencoders Trained on the Same Data Learn Different Features

G Paulo, N Belrose - arxiv preprint arxiv:2501.16615, 2025 - arxiv.org

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features
in the activations of large language models (LLMs). While some expect SAEs to find the true …

Salva Cita Articoli correlati Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Propositional interpretability in artificial intelligence

DJ Chalmers - arxiv preprint arxiv:2501.15740, 2025 - arxiv.org

Mechanistic interpretability is the program of explaining what AI systems are doing in terms
of their internal mechanisms. I analyze some aspects of the program, along with setting out …

Salva Cita Articoli correlati Tutte e 2 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

T Heap, T Lawson, L Farnik, L Aitchison - arxiv preprint arxiv:2501.17727, 2025 - arxiv.org

Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the
internal representations of transformers. In this paper, we apply SAEs to'interpret'random …

Salva Cita Articoli correlati Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Steering Large Language Models with Feature Guided Activation Additions

S Soo, W Teng, C Balaganesh - arxiv preprint arxiv:2501.09929, 2025 - arxiv.org

Effective and reliable control over large language model (LLM) behavior is a significant
challenge. While activation steering methods, which add steering vectors to a model's …

Salva Cita Articoli correlati Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Automatically interpreting millions of features in large language models

Partially Rewriting a Transformer in Natural Language

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Sparse Autoencoders Trained on the Same Data Learn Different Features

Propositional interpretability in artificial intelligence

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

Steering Large Language Models with Feature Guided Activation Additions