Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

M Sharma, M Tong, J Mu, J Wei, J Kruthoff… - arxiv preprint arxiv …, 2025 - arxiv.org
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies
that systematically bypass model safeguards and enable users to carry out harmful …

Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

J Rando, J Zhang, N Carlini, F Tramèr - arxiv preprint arxiv:2502.02260, 2025 - arxiv.org
In the past decade, considerable research effort has been devoted to securing machine
learning (ML) models that operate in adversarial settings. Yet, progress has been slow even …

Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

N Wang, Y Deng, S Fan, J Yin, SK Ng - arxiv preprint arxiv:2501.03292, 2025 - arxiv.org
Federated learning (FL) has attracted considerable interest in the medical domain due to its
capacity to facilitate collaborative model training while maintaining data privacy. However …

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

P Röttger, G Attanasio, F Friedrich, J Goldzycher… - arxiv preprint arxiv …, 2025 - arxiv.org
Vision-language models (VLMs), which process image and text inputs, are increasingly
integrated into chat assistants and other consumer AI applications. Without proper …

Peering Behind the Shield: Guardrail Identification in Large Language Models

Z Yang, Y Wu, R Wen, M Backes, Y Zhang - arxiv preprint arxiv …, 2025 - arxiv.org
Human-AI conversations have gained increasing attention since the era of large language
models. Consequently, more techniques, such as input/output guardrails and safety …

Towards Efficient Large Multimodal Model Serving

H Qiu, A Biswas, Z Zhao, J Mohan, A Khare… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advances in generative AI have led to large multi-modal models (LMMs) capable of
simultaneously processing inputs of various modalities such as text, images, video, and …

ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

W Lee, D Lee, E Choi, S Yu, A Yousefpour… - arxiv preprint arxiv …, 2025 - arxiv.org
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that
induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated …

Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models

J Yang, B Yan, R Li, Z Zhou, X Chen, Z Feng… - arxiv preprint arxiv …, 2025 - arxiv.org
Unsafe prompts pose significant safety risks to large language models (LLMs). Existing
methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail …

Universal Adversarial Attack on Aligned Multimodal LLMs

T Rahmatullaev, P Druzhinina, M Mikhalchuk… - arxiv preprint arxiv …, 2025 - arxiv.org
We propose a universal adversarial attack on multimodal Large Language Models (LLMs)
that leverages a single optimized image to override alignment safeguards across diverse …

FLAME: Flexible LLM-Assisted Moderation Engine

I Bakulin, I Kopanichuk, I Bespalov… - arxiv preprint arxiv …, 2025 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has introduced significant
challenges in moderating user-model interactions. While LLMs demonstrate remarkable …