Agentharm: A benchmark for measuring harmfulness of llm agents

M Andriushchenko, A Souly, M Dziemian… - arxiv preprint arxiv …, 2024 - arxiv.org
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent
safety measures and misuse model capabilities, has been studied primarily for LLMs acting …

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

H Yang, L Qu, E Shareghi, G Haffari - arxiv preprint arxiv:2410.11459, 2024 - arxiv.org
Large language models (LLMs) have exhibited outstanding performance in engaging with
humans and addressing complex questions by leveraging their vast implicit knowledge and …

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

T Wu, L Mei, R Yuan, L Li, W Xue, Y Guo - arxiv preprint arxiv:2410.03857, 2024 - arxiv.org
While recent advancements in large language model (LLM) alignment have enabled the
effective identification of malicious objectives involving scene nesting and keyword rewriting …