Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2501.17433, 2025 - arxiv.org
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-
tuning attacks--models lose their safety alignment ability after fine-tuning on a few harmful …