Open problems in machine unlearning for ai safety

F Barez, T Fu, A Prabhu, S Casper, A Sanyal… - arxiv preprint arxiv …, 2025 - arxiv.org
As AI systems become more capable, widely deployed, and increasingly autonomous in
critical areas such as cybersecurity, biological research, and healthcare, ensuring their …

Steering language model refusal with sparse autoencoders

K O'Brien, D Majercak, X Fernandes, R Edgar… - arxiv preprint arxiv …, 2024 - arxiv.org
Responsible practices for deploying language models include guiding models to recognize
and refuse answering prompts that are considered unsafe, while complying with safe …

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Y Gur-Arieh, R Mayan, C Agassy, A Geiger… - arxiv preprint arxiv …, 2025 - arxiv.org
Automated interpretability pipelines generate natural language descriptions for the concepts
represented by features in large language models (LLMs), such as plants or the first word in …

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

B Cywiński, K Deja - arxiv preprint arxiv:2501.18052, 2025 - arxiv.org
Recent machine unlearning approaches offer promising solution for removing unwanted
concepts from diffusion models. However, traditional methods, which largely rely on fine …