Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Emergence of hidden capabilities: Exploring learning dynamics in concept space

CF Park, M Okawa, A Lee… - Advances in Neural …, 2025 - proceedings.neurips.cc
Modern generative models demonstrate impressive capabilities, likely stemming from an
ability to identify and manipulate abstract concepts underlying their training data. However …

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Preference tuning for toxicity mitigation generalizes across languages

X Li, ZX Yong, SH Bach - arxiv preprint arxiv:2406.16235, 2024 - arxiv.org
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their
increasing global use. In this work, we explore zero-shot cross-lingual generalization of …

Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models

B An, S Zhu, R Zhang, MA Panaitescu-Liess… - arxiv preprint arxiv …, 2024 - arxiv.org
Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …

Latent adversarial training improves robustness to persistent harmful behaviors in llms

A Sheshadri, A Ewart, P Guo, A Lynch, C Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) can often be made to behave in undesirable ways that they
are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a …

What makes and breaks safety fine-tuning? a mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy, PHS Torr… - arxiv preprint arxiv …, 2024 - arxiv.org
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

On evaluating the durability of safeguards for open-weight llms

X Qi, B Wei, N Carlini, Y Huang, T **e, L He… - arxiv preprint arxiv …, 2024 - arxiv.org
Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …

Robust LLM safeguarding via refusal feature adversarial training

L Yu, V Do, K Hambardzumyan… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …

Understanding jailbreak success: A study of latent space dynamics in large language models

S Ball, F Kreuter, N Panickssery - arxiv preprint arxiv:2406.09289, 2024 - arxiv.org
Conversational large language models are trained to refuse to answer harmful questions.
However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an …