Cold-attack: Jailbreaking llms with stealthiness and controllability

X Guo, F Yu, H Zhang, L Qin, B Hu - arxiv preprint arxiv:2402.08679, 2024 - arxiv.org
Jailbreaks on large language models (LLMs) have recently received increasing attention.
For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with …

Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts

ZY Chin, CM Jiang, CC Huang, PY Chen… - arxiv preprint arxiv …, 2023 - arxiv.org
Text-to-image diffusion models, eg Stable Diffusion (SD), lately have shown remarkable
ability in high-quality content generation, and become one of the representatives for the …

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

L Lin, H Mu, Z Zhai, M Wang, Y Wang, R Wang… - Journal of Artificial …, 2025 - jair.org
Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …

Flirt: Feedback loop in-context red teaming

N Mehrabi, P Goyal, C Dupuy, Q Hu, S Ghosh… - arxiv preprint arxiv …, 2023 - arxiv.org
Warning: this paper contains content that may be inappropriate or offensive. As generative
models become available for public use in various applications, testing and analyzing …

Trustworthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations

C Chen, X Gong, Z Liu, W Jiang, SQ Goh… - arxiv preprint arxiv …, 2024 - arxiv.org
AI Safety is an emerging area of critical importance to the safe adoption and deployment of
AI systems. With the rapid proliferation of AI and especially with the recent advancement of …

Exploring safety-utility trade-offs in personalized language models

AR Vij**i, SBR Chowdhury, S Chaturvedi - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) become increasingly integrated into daily applications, it
is essential to ensure they operate fairly across diverse user demographics. In this work, we …

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

A Beutel, K **ao, J Heidecke, L Weng - arxiv preprint arxiv:2412.18693, 2024 - arxiv.org
Automated red teaming can discover rare model failures and generate challenging
examples that can be used for training or evaluation. However, a core challenge in …

Asetf: A novel method for jailbreak attack on llms through translate suffix embeddings

H Wang, H Li, M Huang, L Sha - arxiv preprint arxiv:2402.16006, 2024 - arxiv.org
The safety defense methods of Large language models (LLMs) stays limited because the
dangerous prompts are manually curated to just few known attack types, which fails to keep …

Impact of non-standard unicode characters on security and comprehension in large language models

JS Daniel, A Pal - arxiv preprint arxiv:2405.14490, 2024 - arxiv.org
The advancement of large language models has significantly improved natural language
processing. However, challenges such as jailbreaks (prompt injections that cause an LLM to …

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

J Nöther, A Singla, G Radanović - arxiv preprint arxiv:2501.08246, 2025 - arxiv.org
Recent work has proposed automated red-teaming methods for testing the vulnerabilities of
a given target large language model (LLM). These methods use red-teaming LLMs to …