Harmful fine-tuning attacks and defenses for large language models: A survey

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.18169, 2024 - arxiv.org
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …

[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning

T Huang, S Hu, F Ilhan, SF Tekin… - arxiv preprint arxiv …, 2024 - openreview.net
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …

Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.01586, 2024 - arxiv.org
Harmful fine-tuning issue\citep {qi2023fine} poses serious safety concerns for Large
language models' fine-tuning-as-a-service. While existing defenses\citep …

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

G Liu, W Lin, T Huang, R Mo, Q Mu, L Shen - arxiv preprint arxiv …, 2024 - arxiv.org
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …

Backtracking improves generation safety

Y Zhang, J Chi, H Nguyen, K Upasani, DM Bikel… - arxiv preprint arxiv …, 2024 - arxiv.org
Text generation has a fundamental limitation almost by definition: there is no taking back
tokens that have been generated, even when they are clearly problematic. In the context of …

On evaluating the durability of safeguards for open-weight llms

X Qi, B Wei, N Carlini, Y Huang, T **e, L He… - arxiv preprint arxiv …, 2024 - arxiv.org
Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …

Position: Llm unlearning benchmarks are weak measures of progress

P Thaker, S Hu, N Kale, Y Maurya, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …

Meta-unlearning on diffusion models: Preventing relearning unlearned concepts

H Gao, T Pang, C Du, T Hu, Z Deng, M Lin - arxiv preprint arxiv …, 2024 - arxiv.org
With the rapid progress of diffusion-based content generation, significant efforts are being
made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to …

Oml: Open, monetizable, and loyal ai

Z Cheng, E Contente, B Finch, O Golev… - arxiv preprint arxiv …, 2024 - arxiv.org
Artificial Intelligence (AI) has steadily improved across a wide range of tasks. However, the
development and deployment of AI are almost entirely controlled by a few powerful …

A Closer Look at Machine Unlearning for Large Language Models

X Yuan, T Pang, C Du, K Chen, W Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) may memorize sensitive or copyrighted content, raising
privacy and legal concerns. Due to the high cost of retraining from scratch, researchers …