Harmful fine-tuning attacks and defenses for large language models: A survey

T Huang, S Hu, F Ilhan, SF Tekin, L Liu - arxiv preprint arxiv:2409.18169, 2024 - arxiv.org
Recent research demonstrates that the nascent fine-tuning-as-a-service business model
exposes serious safety concerns--fine-tuning over a few harmful data uploaded by the users …

Mitigating backdoor threats to large language models: Advancement and challenges

Q Liu, W Mo, T Tong, J Xu, F Wang… - 2024 60th Annual …, 2024 - ieeexplore.ieee.org
The advancement of Large Language Models (LLMs) has significantly impacted various
domains, including Web search, healthcare, and software development. However, as these …

Denial-of-service poisoning attacks against large language models

K Gao, T Pang, C Du, Y Yang, ST **a, M Lin - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have shown that LLMs are vulnerable to denial-of-service (DoS) attacks,
where adversarial inputs like spelling errors or non-semantic prompts trigger endless …

Navigating the risks: A survey of security, privacy, and ethics threats in llm-based agents

Y Gan, Y Yang, Z Ma, P He, R Zeng, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
With the continuous development of large language models (LLMs), transformer-based
models have made groundbreaking advances in numerous natural language processing …

Safety at Scale: A Comprehensive Survey of Large Model Safety

X Ma, Y Gao, Y Wang, R Wang, X Wang, Y Sun… - arxiv preprint arxiv …, 2025 - arxiv.org
The rapid advancement of large models, driven by their exceptional abilities in learning and
generalization through large-scale pre-training, has reshaped the landscape of Artificial …

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

H Ge, Y Li, Q Wang, Y Zhang, R Tang - arxiv preprint arxiv:2411.12701, 2024 - arxiv.org
Large Language Models (LLMs) are vulnerable to backdoor attacks, where hidden triggers
can maliciously manipulate model behavior. While several backdoor attack methods have …

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

K Grimes, M Christiani, D Shriver, M Connor - arxiv preprint arxiv …, 2024 - arxiv.org
Model editing methods modify specific behaviors of Large Language Models by altering a
small, targeted set of network weights and require very little data and compute. These …

Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning

O Mengara - arxiv preprint arxiv:2412.17908, 2024 - arxiv.org
With the rapid development of generative artificial intelligence, particularly large language
models, a number of sub-fields of deep learning have made significant progress and are …

On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs

AAM Bahar, AS Wazan - arxiv preprint arxiv:2412.20087, 2024 - arxiv.org
This research investigates the effectiveness of established vulnerability metrics, such as the
Common Vulnerability Scoring System (CVSS), in evaluating attacks against Large …

The TIP of the Iceberg: Revealing a Hidden Class of Task-In-Prompt Adversarial Attacks on LLMs

S Berezin, R Farahbakhsh, N Crespi - arxiv preprint arxiv:2501.18626, 2025 - arxiv.org
We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt
(TIP) attacks. Our approach embeds sequence-to-sequence tasks (eg, cipher decoding …