Rethinking machine unlearning for large language models

S Liu, Y Yao, J Jia, S Casper, N Baracaldo… - Nature Machine …, 2025‏ - nature.com
We explore machine unlearning in the domain of large language models (LLMs), referred to
as LLM unlearning. This initiative aims to eliminate undesirable data influence (for example …

Threats, attacks, and defenses in machine unlearning: A survey

Z Liu, H Ye, C Chen, Y Zheng… - IEEE Open Journal of the …, 2025‏ - ieeexplore.ieee.org
Machine Unlearning (MU) has recently gained considerable attention due to its potential to
achieve Safe AI by removing the influence of specific data from trained Machine Learning …

Rethinking llm memorization through the lens of adversarial compression

A Schwarzschild, Z Feng, P Maini… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
Large language models (LLMs) trained on web-scale datasets raise substantial concerns
regarding permissible data usage. One major question is whether these models" memorize" …

Large language model unlearning via embedding-corrupted prompts

C Liu, Y Wang, J Flanigan, Y Liu - Advances in Neural …, 2025‏ - proceedings.neurips.cc
Large language models (LLMs) have advanced to encompass extensive knowledge across
diverse domains. Yet controlling what a large language model should not know is important …

Negative preference optimization: From catastrophic collapse to effective unlearning

R Zhang, L Lin, Y Bai, S Mei - arxiv preprint arxiv:2404.05868, 2024‏ - arxiv.org
Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data
during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from …

Muse: Machine unlearning six-way evaluation for language models

W Shi, J Lee, Y Huang, S Malladi, J Zhao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Language models (LMs) are trained on vast amounts of text data, which may include private
and copyrighted content. Data owners may request the removal of their data from a trained …

Stress-testing capability elicitation with password-locked models

R Greenblatt, F Roger… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
To determine the safety of large language models (LLMs), AI developers must be able to
assess their dangerous capabilities. But simple prompting strategies often fail to elicit an …

What makes and breaks safety fine-tuning? a mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy… - Advances in …, 2025‏ - proceedings.neurips.cc
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

Guardrail baselines for unlearning in llms

P Thaker, Y Maurya, S Hu, ZS Wu, V Smith - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent work has demonstrated that finetuning is a promising approach to'unlearn'concepts
from large language models. However, finetuning can be expensive, as it requires both …

Tamper-resistant safeguards for open-weight llms

R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Rapid advances in the capabilities of large language models (LLMs) have raised
widespread concerns regarding their potential for malicious use. Open-weight LLMs present …