Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents
Although LLM-based agents, powered by Large Language Models (LLMs), can use external
tools and memory mechanisms to solve complex real-world tasks, they may also introduce …
tools and memory mechanisms to solve complex real-world tasks, they may also introduce …
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
Alignment in large language models (LLMs) is used to enforce guidelines such as safety.
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe …
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe …
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
Large language models (LLMs) are vital for a wide range of applications yet remain
susceptible to jailbreak threats, which could lead to the generation of inappropriate …
susceptible to jailbreak threats, which could lead to the generation of inappropriate …
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
A von Recum, C Schnabl, G Hollbeck, S Alberti… - arxiv preprint arxiv …, 2024 - arxiv.org
Refusals-instances where large language models (LLMs) decline or fail to fully execute user
instructions-are crucial for both AI safety and AI capabilities and the reduction of …
instructions-are crucial for both AI safety and AI capabilities and the reduction of …
[PDF][PDF] Extracting the Harmfulness Classifier of Aligned LLMs
JCN Ferrand - 2024 - jcnf.me
3.1 Methodology overview. In the first step, we estimate the harmfulness classifier of an LLM
by (A) selecting a structure within the model and (B) training a classification head on the …
by (A) selecting a structure within the model and (B) training a classification head on the …