Harmful fine-tuning attacks and defenses for large language models: A survey
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …
[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …
Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation
Harmful fine-tuning issue\citep {qi2023fine} poses serious safety concerns for Large
language models' fine-tuning-as-a-service. While existing defenses\citep …
language models' fine-tuning-as-a-service. While existing defenses\citep …
Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …
Locking down the finetuned llms safety
Fine-tuning large language models (LLMs) on additional datasets is often necessary to
optimize them for specific downstream tasks. However, existing safety alignment measures …
optimize them for specific downstream tasks. However, existing safety alignment measures …
Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …
Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. For the first time in the literature …
broken by fine-tuning on a dataset mixed with harmful data. For the first time in the literature …
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-
tuning attacks--models lose their safety alignment ability after fine-tuning on a few harmful …
tuning attacks--models lose their safety alignment ability after fine-tuning on a few harmful …
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Despite the implementation of safety alignment strategies, large language models (LLMs)
remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose …
remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose …
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Evaluations of large language model (LLM) risks and capabilities are increasingly being
incorporated into AI risk management and governance frameworks. Currently, most risk …
incorporated into AI risk management and governance frameworks. Currently, most risk …