Jailbreak and guard aligned language models with only few in-context demonstrations

Z Wei, Y Wang, A Li, Y Mo, Y Wang - arxiv preprint arxiv:2310.06387, 2023 - arxiv.org
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …

Jailbreaking large language models against moderation guardrails via cipher characters

H **, A Zhou, J Menke, H Wang - Advances in Neural …, 2025 - proceedings.neurips.cc
Abstract Large Language Models (LLMs) are typically harmless but remain vulnerable to
carefully crafted prompts known as``jailbreaks'', which can bypass protective measures and …

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H **, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Z Zhang, Y Zhang, L Li, H Gao, L Wang, H Lu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound
capabilities in collective intelligence. However, the potential misuse of this intelligence for …

Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models

C Qian, J Zhang, W Yao, D Liu, Z Yin, Y Qiao… - arxiv preprint arxiv …, 2024 - arxiv.org
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies
concentrate on fully pre-trained LLMs to better understand and improve LLMs' …

Identifying semantic induction heads to understand in-context learning

J Ren, Q Guo, H Yan, D Liu, Q Zhang, X Qiu… - arxiv preprint arxiv …, 2024 - arxiv.org
Although large language models (LLMs) have demonstrated remarkable performance, the
lack of transparency in their inference logic raises concerns about their trustworthiness. To …

Adversarial tuning: Defending against jailbreak attacks for llms

F Liu, Z Xu, H Liu - arxiv preprint arxiv:2406.06622, 2024 - arxiv.org
Although safely enhanced Large Language Models (LLMs) have achieved remarkable
success in tackling various complex tasks in a zero-shot manner, they remain susceptible to …

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Q Ren, H Li, D Liu, Z **e, X Lu, Y Qiao, L Sha… - arxiv preprint arxiv …, 2024 - arxiv.org
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn
interactions, where malicious users can obscure harmful intents across several queries. We …

Contextual api completion for unseen repositories using llms

N Nashid, T Shabani, P Alian, A Mesbah - arxiv preprint arxiv:2405.04600, 2024 - arxiv.org
Large language models have made substantial progress in addressing diverse code-related
tasks. However, their adoption is hindered by inconsistencies in generating output due to the …

Vlsbench: Unveiling visual leakage in multimodal safety

X Hu, D Liu, H Li, X Huang, J Shao - arxiv preprint arxiv:2411.19939, 2024 - arxiv.org
Safety concerns of Multimodal large language models (MLLMs) have gradually become an
important problem in various applications. Surprisingly, previous works indicate a counter …