Privacy in large language models: Attacks, defenses and future directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang… - arxiv preprint arxiv …, 2023 - arxiv.org
The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Y Zhou, Z Huang, F Lu, Z Qin, W Wang - arxiv preprint arxiv:2404.16369, 2024 - arxiv.org
Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating
responses consistent with human values. Despite their ability to recognize and avoid …

Gradient-based jailbreak images for multimodal fusion models

J Rando, H Korevaar, E Brinkman, I Evtimov… - arxiv preprint arxiv …, 2024 - arxiv.org
Augmenting language models with image inputs may enable more effective jailbreak attacks
through continuous optimization, unlike text inputs that require discrete optimization …

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

H Wang, G Wang, H Zhang - arxiv preprint arxiv:2411.16721, 2024 - arxiv.org
Vision Language Models (VLMs) can produce unintended and harmful content when
exposed to adversarial attacks, particularly because their vision capabilities create new …

Towards the Worst-case Robustness of Large Language Models

H Chen, Y Dong, Z Wei, H Su, J Zhu - arxiv preprint arxiv:2501.19040, 2025 - arxiv.org
Recent studies have revealed the vulnerability of Large Language Models (LLMs) to
adversarial attacks, where the adversary crafts specific input sequences to induce harmful …

Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

Z Wang, C **e, B Bartoldson, B Kailkhura - arxiv preprint arxiv …, 2025 - arxiv.org
This paper investigates the robustness of vision-language models against adversarial visual
perturbations and introduces a novel``double visual defense" to enhance this robustness …

Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack

C Gu, J Gu, A Hua, Y Qin - openreview.net
Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained
attention for their capabilities in image recognition and understanding. However, while …