Llm defenses are not robust to multi-turn human jailbreaks yet

N Li, Z Han, I Steneker, W Primack, R Goodside… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent large language model (LLM) defenses have greatly improved models' ability to
refuse harmful queries, even when adversarially attacked. However, LLM defenses are …

Jailbreaking llm-controlled robots

A Robey, Z Ravichandran, V Kumar, H Hassani… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent introduction of large language models (LLMs) has revolutionized the field of
robotics by enabling contextual reasoning and intuitive human-robot interaction in domains …

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training

Y Yuan, W Jiao, W Wang, J Huang, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
This study addresses a critical gap in safety tuning practices for Large Language Models
(LLMs) by identifying and tackling a refusal position bias within safety tuning data, which …

Robust LLM safeguarding via refusal feature adversarial training

L Yu, V Do, K Hambardzumyan… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …

Gradient routing: Masking gradients to localize computation in neural networks

A Cloud, J Goldman-Wetzler, E Wybitul, J Miller… - arxiv preprint arxiv …, 2024 - arxiv.org
Neural networks are trained primarily based on their inputs and outputs, without regard for
their internal mechanisms. These neglected mechanisms determine properties that are …

Position: Llm unlearning benchmarks are weak measures of progress

P Thaker, S Hu, N Kale, Y Maurya, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …

Open Problems in Machine Unlearning for AI Safety

F Barez, T Fu, A Prabhu, S Casper, A Sanyal… - arxiv preprint arxiv …, 2025 - arxiv.org
As AI systems become more capable, widely deployed, and increasingly autonomous in
critical areas such as cybersecurity, biological research, and healthcare, ensuring their …

An FDA for AI? Pitfalls and Plausibility of Approval Regulation for Frontier Artificial Intelligence

D Carpenter, C Ezell - Proceedings of the AAAI/ACM Conference on AI …, 2024 - ojs.aaai.org
Observers and practitioners of artificial intelligence (AI) have proposed an FDA-style
licensing regime for the most advanced AI models, or'frontier'models. In this paper, we …

A probabilistic perspective on unlearning and alignment for large language models

Y Scholten, S Günnemann, L Schwinn - arxiv preprint arxiv:2410.03523, 2024 - arxiv.org
Comprehensive evaluation of Large Language Models (LLMs) is an open research problem.
Existing evaluations rely on deterministic point estimates generated via greedy decoding …

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

T Wu, S Zhang, K Song, S Xu, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) are susceptible to security and safety threats, such as
prompt injection, prompt extraction, and harmful requests. One major cause of these …