Harmful fine-tuning attacks and defenses for large language models: A survey

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.18169, 2024 - arxiv.org
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …

[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning

T Huang, S Hu, F Ilhan, SF Tekin… - arxiv preprint arxiv …, 2024 - openreview.net
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …

Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.01586, 2024 - arxiv.org
Harmful fine-tuning issue\citep {qi2023fine} poses serious safety concerns for Large
language models' fine-tuning-as-a-service. While existing defenses\citep …

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

G Liu, W Lin, T Huang, R Mo, Q Mu, L Shen - arxiv preprint arxiv …, 2024 - arxiv.org
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …

Locking down the finetuned llms safety

M Zhu, L Yang, Y Wei, N Zhang, Y Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Fine-tuning large language models (LLMs) on additional datasets is often necessary to
optimize them for specific downstream tasks. However, existing safety alignment measures …

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

T Huang, S Hu, L Liu - The Thirty-eighth Annual Conference on …, 2024 - openreview.net
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …

Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

T Huang, S Hu, F Ilhan, SF Tekin… - The Thirty-eighth Annual …, 2024 - openreview.net
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. For the first time in the literature …

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2501.17433, 2025 - arxiv.org
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-
tuning attacks--models lose their safety alignment ability after fine-tuning on a few harmful …

JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

S Zhang, Y Zhai, K Guo, H Hu, S Guo, Z Fang… - arxiv preprint arxiv …, 2025 - arxiv.org
Despite the implementation of safety alignment strategies, large language models (LLMs)
remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose …

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Z Che, S Casper, R Kirk, A Satheesh, S Slocum… - arxiv preprint arxiv …, 2025 - arxiv.org
Evaluations of large language model (LLM) risks and capabilities are increasingly being
incorporated into AI risk management and governance frameworks. Currently, most risk …