Fine-tuning aligned language models compromises safety, even when users do not intend to!

X Qi, Y Zeng, T **e, PY Chen, R Jia, P Mittal… - arxiv preprint arxiv …, 2023 - arxiv.org
Optimizing large language models (LLMs) for downstream use cases often involves the
customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Muse: Machine unlearning six-way evaluation for language models

W Shi, J Lee, Y Huang, S Malladi, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Language models (LMs) are trained on vast amounts of text data, which may include private
and copyrighted content. Data owners may request the removal of their data from a trained …

Defending against unforeseen failure modes with latent adversarial training

S Casper, L Schulze, O Patel… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit
harmful unintended behaviors. Finding and fixing these is challenging because the attack …

An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Safety alignment should be made more than just a few tokens deep

X Qi, A Panda, K Lyu, X Ma, S Roy, A Beirami… - arxiv preprint arxiv …, 2024 - arxiv.org
The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively
simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that …

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

L Lin, H Mu, Z Zhai, M Wang, Y Wang, R Wang… - Journal of Artificial …, 2025 - jair.org
Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …

Evaluating copyright takedown methods for language models

B Wei, W Shi, Y Huang, NA Smith, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Language models (LMs) derive their capabilities from extensive training on diverse data,
including potentially copyrighted material. These models can memorize and generate …

Decoding compressed trust: Scrutinizing the trustworthiness of efficient llms under compression

J Hong, J Duan, C Zhang, Z Li, C **e… - arxiv preprint arxiv …, 2024 - arxiv.org
Compressing high-capability Large Language Models (LLMs) has emerged as a favored
strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods …