An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Open problems in machine unlearning for ai safety

F Barez, T Fu, A Prabhu, S Casper, A Sanyal… - arxiv preprint arxiv …, 2025 - arxiv.org
As AI systems become more capable, widely deployed, and increasingly autonomous in
critical areas such as cybersecurity, biological research, and healthcare, ensuring their …

Position: Llm unlearning benchmarks are weak measures of progress

P Thaker, S Hu, N Kale, Y Maurya, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

H Chen, S Szyller, W Xu, N Himayat - arxiv preprint arxiv:2502.15836, 2025 - arxiv.org
Large language models (LLMs) have become increasingly popular. Their emergent
capabilities can be attributed to their massive training datasets. However, these datasets …

A General Framework to Enhance Fine-tuning-based LLM Unlearning

J Ren, Z Dai, X Tang, H Liu, J Zeng, Z Li… - arxiv preprint arxiv …, 2025 - arxiv.org
Unlearning has been proposed to remove copyrighted and privacy-sensitive data from
Large Language Models (LLMs). Existing approaches primarily rely on fine-tuning-based …

Rethinking The Reliability of Representation Engineering in Large Language Models

Z Deng, J Jiang, G Long, C Zhang - openreview.net
Inspired by cognitive neuroscience, representation engineering (RepE) seeks to connect the
neural activities within large language models (LLMs) to their behaviors, providing a …