Harmful fine-tuning attacks and defenses for large language models: A survey

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.18169, 2024 - arxiv.org
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …

[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning

T Huang, S Hu, F Ilhan, SF Tekin… - arxiv preprint arxiv …, 2024 - openreview.net
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …

An Overview of Trustworthy AI: Advances in IP Protection, Privacy-preserving Federated Learning, Security Verification, and GAI Safety Alignment

Y Zheng, CH Chang, SH Huang… - IEEE Journal on …, 2024 - ieeexplore.ieee.org
AI has undergone a remarkable evolution journey marked by groundbreaking milestones.
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

G Liu, W Lin, T Huang, R Mo, Q Mu, L Shen - arxiv preprint arxiv …, 2024 - arxiv.org
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …

Programming refusal with conditional activation steering

BW Lee, I Padhi, KN Ramamurthy, E Miehling… - arxiv preprint arxiv …, 2024 - arxiv.org
LLMs have shown remarkable capabilities, but precisely controlling their response behavior
remains challenging. Existing activation steering methods alter LLM behavior …

Gradient routing: Masking gradients to localize computation in neural networks

A Cloud, J Goldman-Wetzler, E Wybitul, J Miller… - arxiv preprint arxiv …, 2024 - arxiv.org
Neural networks are trained primarily based on their inputs and outputs, without regard for
their internal mechanisms. These neglected mechanisms determine properties that are …

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection

H Shen, PY Chen, P Das, T Chen - arxiv preprint arxiv:2410.07471, 2024 - arxiv.org
Fine-tuning on task-specific data to boost downstream performance is a crucial step for
leveraging Large Language Models (LLMs). However, previous studies have demonstrated …

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

T Huang, S Hu, L Liu - The Thirty-eighth Annual Conference on …, 2024 - openreview.net
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

D Wu, X Lu, Y Zhao, B Qin - arxiv preprint arxiv:2412.11041, 2024 - arxiv.org
Although large language models (LLMs) achieve effective safety alignment at the time of
release, they still face various safety challenges. A key issue is that fine-tuning often …

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

M Li, WM Si, M Backes, Y Zhang, Y Wang - arxiv preprint arxiv …, 2025 - arxiv.org
As advancements in large language models (LLMs) continue and the demand for
personalized models increases, parameter-efficient fine-tuning (PEFT) methods (eg, LoRA) …