Harmful fine-tuning attacks and defenses for large language models: A survey
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …
[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …
An Overview of Trustworthy AI: Advances in IP Protection, Privacy-preserving Federated Learning, Security Verification, and GAI Safety Alignment
AI has undergone a remarkable evolution journey marked by groundbreaking milestones.
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …
Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …
Programming refusal with conditional activation steering
LLMs have shown remarkable capabilities, but precisely controlling their response behavior
remains challenging. Existing activation steering methods alter LLM behavior …
remains challenging. Existing activation steering methods alter LLM behavior …
Gradient routing: Masking gradients to localize computation in neural networks
Neural networks are trained primarily based on their inputs and outputs, without regard for
their internal mechanisms. These neglected mechanisms determine properties that are …
their internal mechanisms. These neglected mechanisms determine properties that are …
Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection
Fine-tuning on task-specific data to boost downstream performance is a crucial step for
leveraging Large Language Models (LLMs). However, previous studies have demonstrated …
leveraging Large Language Models (LLMs). However, previous studies have demonstrated …
Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models
D Wu, X Lu, Y Zhao, B Qin - arxiv preprint arxiv:2412.11041, 2024 - arxiv.org
Although large language models (LLMs) achieve effective safety alignment at the time of
release, they still face various safety challenges. A key issue is that fine-tuning often …
release, they still face various safety challenges. A key issue is that fine-tuning often …
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
As advancements in large language models (LLMs) continue and the demand for
personalized models increases, parameter-efficient fine-tuning (PEFT) methods (eg, LoRA) …
personalized models increases, parameter-efficient fine-tuning (PEFT) methods (eg, LoRA) …