Who's asking? User personas and the mechanics of latent misalignment

A Ghandeharioun, A Yuan, M Guerard… - Advances in …, 2025 - proceedings.neurips.cc
Studies show that safety-tuned models may nevertheless divulge harmful information. In this
work, we show that whether they do so depends significantly on who they are talking to …

Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders

V Surkov, C Wendler, M Terekhov… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of
large-language models (LLMs). For LLMs, they have been shown to decompose …

Open Problems in Mechanistic Interpretability

L Sharkey, B Chughtai, J Batson, J Lindsey… - arxiv preprint arxiv …, 2025 - arxiv.org
Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

C Dumas, C Wendler, V Veselovsky, G Monea… - arxiv preprint arxiv …, 2024 - arxiv.org
A central question in multilingual language modeling is whether large language models
(LLMs) develop a universal concept representation, disentangled from specific languages …

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

M Cai, Y Zhang, S Zhang, F Yin, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose SelfControl, an inference-time model control method utilizing gradients to
control the behavior of large language models (LLMs) without explicit human annotations …

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

A Pan, L Chen, J Steinhardt - arxiv preprint arxiv:2412.08686, 2024 - arxiv.org
Interpretability methods seek to understand language model representations, yet the outputs
of most such methods--circuits, vectors, scalars--are not immediately human-interpretable. In …

Monet: Mixture of Monosemantic Experts for Transformers

J Park, YJ Ahn, KE Kim, J Kang - arxiv preprint arxiv:2412.04139, 2024 - arxiv.org
Understanding the internal computations of large language models (LLMs) is crucial for
aligning them with human values and preventing undesirable behaviors like toxic content …

How do llamas process multilingual text? a latent exploration through activation patching

C Dumas, V Veselovsky, G Monea, R West… - ICML 2024 Workshop …, 2024 - openreview.net
A central question in multilingual language modeling is whether large language models
(LLMs) develop a universal concept representation, disentangled from specific languages …

Controllable Context Sensitivity and the Knob Behind It

J Minder, K Du, N Stoehr, G Monea, C Wendler… - arxiv preprint arxiv …, 2024 - arxiv.org
When making predictions, a language model must trade off how much it relies on its context
vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental …

Unveiling LLM Mechanisms Through Neural ODEs and Control Theory

Y Zhang - arxiv preprint arxiv:2406.16985, 2024 - arxiv.org
This study presents a novel approach that leverages Neural Ordinary Differential Equations
(Neural ODEs) to unravel the intricate relationships between inputs and outputs in Large …