Latent adversarial training improves robustness to persistent harmful behaviors in llms

A Sheshadri, A Ewart, P Guo, A Lynch, C Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) can often be made to behave in undesirable ways that they
are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a …

Trading Inference-Time Compute for Adversarial Robustness

W Zaremba, E Nitishinskaya, B Barak, S Lin… - arxiv preprint arxiv …, 2025 - arxiv.org
We conduct experiments on the impact of increasing inference-time compute in reasoning
models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial …

Adversarial Training: A Survey

M Zhao, L Zhang, J Ye, H Lu, B Yin, X Wang - arxiv preprint arxiv …, 2024 - arxiv.org
Adversarial training (AT) refers to integrating adversarial examples--inputs altered with
imperceptible perturbations that can significantly impact model predictions--into the training …

Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs

H Dingeto, J Kim - Electronics, 2024 - mdpi.com
Transformer-based models are driving a significant revolution in the field of machine
learning at the moment. Among these innovations, vision transformers (ViTs) stand out for …

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

HS Malik, F Shamshad, M Naseer… - arxiv preprint arxiv …, 2025 - arxiv.org
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain
vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate …

Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

Z Wang, C **e, B Bartoldson, B Kailkhura - arxiv preprint arxiv …, 2025 - arxiv.org
This paper investigates the robustness of vision-language models against adversarial visual
perturbations and introduces a novel``double visual defense" to enhance this robustness …

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

X Wang, K Chen, J Zhang, J Chen, X Ma - arxiv preprint arxiv:2411.13136, 2024 - arxiv.org
Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated
excellent zero-shot generalizability across various downstream tasks. However, recent …

MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness

X Xu, S Yu, Z Liu, S Picek - arxiv preprint arxiv:2312.04960, 2023 - arxiv.org
Vision Transformers (ViTs) achieve excellent performance in various tasks, but they are also
vulnerable to adversarial attacks. Building robust ViTs is highly dependent on dedicated …

[PDF][PDF] Deep Learning for Robust Facial Expression Recognition: A Resilient Defense Against Adversarial Attacks

Adversarial attacks can be extremely dangerous, particularly in scenarios where the
precision of facial expression identification is of utmost importance. Hiring adversarial …