Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

H Zhang, J Huang, K Mei, Y Yao, Z Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Although LLM-based agents, powered by Large Language Models (LLMs), can use external
tools and memory mechanisms to solve complex real-world tasks, they may also introduce …

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

JCN Ferrand, Y Beugin, E Pauley, R Sheatsley… - arxiv preprint arxiv …, 2025 - arxiv.org
Alignment in large language models (LLMs) is used to enforce guidelines such as safety.
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe …

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

X Yang, G Deng, J Shi, T Zhang, JS Dong - arxiv preprint arxiv …, 2025 - arxiv.org
Large language models (LLMs) are vital for a wide range of applications yet remain
susceptible to jailbreak threats, which could lead to the generation of inappropriate …

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

A von Recum, C Schnabl, G Hollbeck, S Alberti… - arxiv preprint arxiv …, 2024 - arxiv.org
Refusals-instances where large language models (LLMs) decline or fail to fully execute user
instructions-are crucial for both AI safety and AI capabilities and the reduction of …

[PDF][PDF] Extracting the Harmfulness Classifier of Aligned LLMs

JCN Ferrand - 2024 - jcnf.me
3.1 Methodology overview. In the first step, we estimate the harmfulness classifier of an LLM
by (A) selecting a structure within the model and (B) training a classification head on the …