Who's asking? User personas and the mechanics of latent misalignment
A Ghandeharioun, A Yuan, M Guerard… - Advances in …, 2025 - proceedings.neurips.cc
Studies show that safety-tuned models may nevertheless divulge harmful information. In this
work, we show that whether they do so depends significantly on who they are talking to …
work, we show that whether they do so depends significantly on who they are talking to …
Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders
Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of
large-language models (LLMs). For LLMs, they have been shown to decompose …
large-language models (LLMs). For LLMs, they have been shown to decompose …
Open Problems in Mechanistic Interpretability
Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …
neural networks' capabilities in order to accomplish concrete scientific and engineering …
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
A central question in multilingual language modeling is whether large language models
(LLMs) develop a universal concept representation, disentangled from specific languages …
(LLMs) develop a universal concept representation, disentangled from specific languages …
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
We propose SelfControl, an inference-time model control method utilizing gradients to
control the behavior of large language models (LLMs) without explicit human annotations …
control the behavior of large language models (LLMs) without explicit human annotations …
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Interpretability methods seek to understand language model representations, yet the outputs
of most such methods--circuits, vectors, scalars--are not immediately human-interpretable. In …
of most such methods--circuits, vectors, scalars--are not immediately human-interpretable. In …
Monet: Mixture of Monosemantic Experts for Transformers
Understanding the internal computations of large language models (LLMs) is crucial for
aligning them with human values and preventing undesirable behaviors like toxic content …
aligning them with human values and preventing undesirable behaviors like toxic content …
How do llamas process multilingual text? a latent exploration through activation patching
A central question in multilingual language modeling is whether large language models
(LLMs) develop a universal concept representation, disentangled from specific languages …
(LLMs) develop a universal concept representation, disentangled from specific languages …
Controllable Context Sensitivity and the Knob Behind It
When making predictions, a language model must trade off how much it relies on its context
vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental …
vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental …
Unveiling LLM Mechanisms Through Neural ODEs and Control Theory
Y Zhang - arxiv preprint arxiv:2406.16985, 2024 - arxiv.org
This study presents a novel approach that leverages Neural Ordinary Differential Equations
(Neural ODEs) to unravel the intricate relationships between inputs and outputs in Large …
(Neural ODEs) to unravel the intricate relationships between inputs and outputs in Large …