Rethinking machine unlearning for large language models

S Liu, Y Yao, J Jia, S Casper, N Baracaldo… - Nature Machine …, 2025 - nature.com
We explore machine unlearning in the domain of large language models (LLMs), referred to
as LLM unlearning. This initiative aims to eliminate undesirable data influence (for example …

A comprehensive study of knowledge editing for large language models

N Zhang, Y Yao, B Tian, P Wang, S Deng… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have shown extraordinary capabilities in understanding
and generating text that closely mirrors human communication. However, a primary …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

From persona to personalization: A survey on role-playing language agents

J Chen, X Wang, R Xu, S Yuan, Y Zhang, W Shi… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in large language models (LLMs) have significantly boosted the rise
of Role-Playing Language Agents (RPLAs), ie, specialized AI systems designed to simulate …

Defending against unforeseen failure modes with latent adversarial training

S Casper, L Schulze, O Patel… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit
harmful unintended behaviors. Finding and fixing these is challenging because the attack …

A causal explainable guardrails for large language models

Z Chu, Y Wang, L Li, Z Wang, Z Qin, K Ren - Proceedings of the 2024 on …, 2024 - dl.acm.org
Large Language Models (LLMs) have shown impressive performance in natural language
tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for …

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

L Lin, H Mu, Z Zhai, M Wang, Y Wang, R Wang… - Journal of Artificial …, 2025 - jair.org
Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …

Securing large language models: Threats, vulnerabilities and responsible practices

S Abdali, R Anarfi, CJ Barberan, J He - arxiv preprint arxiv:2403.12503, 2024 - arxiv.org
Large language models (LLMs) have significantly transformed the landscape of Natural
Language Processing (NLP). Their impact extends across a diverse spectrum of tasks …

On the vulnerability of safety alignment in open-access llms

J Yi, R Ye, Q Chen, B Zhu, S Chen, D Lian… - Findings of the …, 2024 - aclanthology.org
Large language models (LLMs) possess immense capabilities but are susceptible to
malicious exploitation. To mitigate the risk, safety alignment is employed to align LLMs with …

Soul: Unlocking the power of second-order optimization for llm unlearning

J Jia, Y Zhang, Y Zhang, J Liu, B Runwal… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have highlighted the necessity of effective unlearning
mechanisms to comply with data regulations and ethical AI practices. LLM unlearning aims …