- Academic Search

A Ghandeharioun, A Yuan, M Guerard… - Advances in …, 2025 - proceedings.neurips.cc

Studies show that safety-tuned models may nevertheless divulge harmful information. In this
work, we show that whether they do so depends significantly on who they are talking to …

Opslaan Citeren Geciteerd door 3 Verwante artikelen HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders

V Surkov, C Wendler, M Terekhov… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of
large-language models (LLMs). For LLMs, they have been shown to decompose …

Opslaan Citeren Geciteerd door 4 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Open Problems in Mechanistic Interpretability

L Sharkey, B Chughtai, J Batson, J Lindsey… - arxiv preprint arxiv …, 2025 - arxiv.org

Mechanistic interpretability aims to understand the computational mechanisms underlying
neural networks' capabilities in order to accomplish concrete scientific and engineering …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

C Dumas, C Wendler, V Veselovsky, G Monea… - arxiv preprint arxiv …, 2024 - arxiv.org

A central question in multilingual language modeling is whether large language models
(LLMs) develop a universal concept representation, disentangled from specific languages …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

M Cai, Y Zhang, S Zhang, F Yin, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose SelfControl, an inference-time model control method utilizing gradients to
control the behavior of large language models (LLMs) without explicit human annotations …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

A Pan, L Chen, J Steinhardt - arxiv preprint arxiv:2412.08686, 2024 - arxiv.org

Interpretability methods seek to understand language model representations, yet the outputs
of most such methods--circuits, vectors, scalars--are not immediately human-interpretable. In …

Opslaan Citeren Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Monet: Mixture of Monosemantic Experts for Transformers

J Park, YJ Ahn, KE Kim, J Kang - arxiv preprint arxiv:2412.04139, 2024 - arxiv.org

Understanding the internal computations of large language models (LLMs) is crucial for
aligning them with human values and preventing undesirable behaviors like toxic content …

Opslaan Citeren Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

How do llamas process multilingual text? a latent exploration through activation patching

C Dumas, V Veselovsky, G Monea, R West… - ICML 2024 Workshop …, 2024 - openreview.net

A central question in multilingual language modeling is whether large language models
(LLMs) develop a universal concept representation, disentangled from specific languages …

Opslaan Citeren Geciteerd door 2 Verwante artikelen HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Controllable Context Sensitivity and the Knob Behind It

J Minder, K Du, N Stoehr, G Monea, C Wendler… - arxiv preprint arxiv …, 2024 - arxiv.org

When making predictions, a language model must trade off how much it relies on its context
vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental …

Opslaan Citeren Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Unveiling LLM Mechanisms Through Neural ODEs and Control Theory

Y Zhang - arxiv preprint arxiv:2406.16985, 2024 - arxiv.org

This study presents a novel approach that leverages Neural Ordinary Differential Equations
(Neural ODEs) to unravel the intricate relationships between inputs and outputs in Large …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 4 versies HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Selfie: Self-interpretation of large language model embeddings

Who's asking? User personas and the mechanics of latent misalignment

Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders

Open Problems in Mechanistic Interpretability

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Monet: Mixture of Monosemantic Experts for Transformers

How do llamas process multilingual text? a latent exploration through activation patching

Controllable Context Sensitivity and the Knob Behind It

Unveiling LLM Mechanisms Through Neural ODEs and Control Theory