Harmful fine-tuning attacks and defenses for large language models: A survey

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.18169, 2024 - arxiv.org
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …

[PDF][PDF] Lazy safety alignment for large language models against harmful fine-tuning

T Huang, S Hu, F Ilhan, SF Tekin… - arxiv preprint arxiv …, 2024 - openreview.net
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we …

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

G Liu, W Lin, T Huang, R Mo, Q Mu, L Shen - arxiv preprint arxiv …, 2024 - arxiv.org
Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a
recent alignment-stage defense, applies uniform perturbation to all layers of embedding to …

BadJudge: Backdoor Vulnerabilities of LLM-As-A-Judge

T Tong, F Wang, Z Zhao, M Chen - The Thirteenth International …, 2025 - openreview.net
This paper exposes the backdoor threat in automatic evaluation with LLM-as-a-Judge. We
propose a novel threat model, where the adversary assumes control of both the candidate …

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack

T Huang, S Hu, L Liu - The Thirty-eighth Annual Conference on …, 2024 - openreview.net
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large
Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine …

Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

T Huang, S Hu, F Ilhan, SF Tekin… - The Thirty-eighth Annual …, 2024 - openreview.net
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-
broken by fine-tuning on a dataset mixed with harmful data. For the first time in the literature …

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2501.17433, 2025 - arxiv.org
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-
tuning attacks--models lose their safety alignment ability after fine-tuning on a few harmful …

Pre-trained Graphformer-based Ranking at Web-scale Search

Y Li, H **ong, L Kong, Z Sun, H Chen, S Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Both Transformer and Graph Neural Networks (GNNs) have been employed in the domain
of learning to rank (LTR). However, these approaches adhere to two distinct yet …

Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale

Y Li, H **ong, L Kong, J Bian, S Wang, G Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Learning to rank (LTR) is widely employed in web searches to prioritize pertinent webpages
from retrieved content based on input queries. However, traditional LTR models encounter …