Llm defenses are not robust to multi-turn human jailbreaks yet
Recent large language model (LLM) defenses have greatly improved models' ability to
refuse harmful queries, even when adversarially attacked. However, LLM defenses are …
refuse harmful queries, even when adversarially attacked. However, LLM defenses are …
Jailbreaking llm-controlled robots
The recent introduction of large language models (LLMs) has revolutionized the field of
robotics by enabling contextual reasoning and intuitive human-robot interaction in domains …
robotics by enabling contextual reasoning and intuitive human-robot interaction in domains …
Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training
This study addresses a critical gap in safety tuning practices for Large Language Models
(LLMs) by identifying and tackling a refusal position bias within safety tuning data, which …
(LLMs) by identifying and tackling a refusal position bias within safety tuning data, which …
Robust LLM safeguarding via refusal feature adversarial training
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …
responses. Defending against such attacks remains challenging due to the opacity of …
Gradient routing: Masking gradients to localize computation in neural networks
Neural networks are trained primarily based on their inputs and outputs, without regard for
their internal mechanisms. These neglected mechanisms determine properties that are …
their internal mechanisms. These neglected mechanisms determine properties that are …
Position: Llm unlearning benchmarks are weak measures of progress
Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …
Open Problems in Machine Unlearning for AI Safety
As AI systems become more capable, widely deployed, and increasingly autonomous in
critical areas such as cybersecurity, biological research, and healthcare, ensuring their …
critical areas such as cybersecurity, biological research, and healthcare, ensuring their …
An FDA for AI? Pitfalls and Plausibility of Approval Regulation for Frontier Artificial Intelligence
Observers and practitioners of artificial intelligence (AI) have proposed an FDA-style
licensing regime for the most advanced AI models, or'frontier'models. In this paper, we …
licensing regime for the most advanced AI models, or'frontier'models. In this paper, we …
A probabilistic perspective on unlearning and alignment for large language models
Comprehensive evaluation of Large Language Models (LLMs) is an open research problem.
Existing evaluations rely on deterministic point estimates generated via greedy decoding …
Existing evaluations rely on deterministic point estimates generated via greedy decoding …
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Large Language Models (LLMs) are susceptible to security and safety threats, such as
prompt injection, prompt extraction, and harmful requests. One major cause of these …
prompt injection, prompt extraction, and harmful requests. One major cause of these …