- Academic Search

N Li, Z Han, I Steneker, W Primack, R Goodside… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent large language model (LLM) defenses have greatly improved models' ability to
refuse harmful queries, even when adversarially attacked. However, LLM defenses are …

Enregistrer Citer Cité 24 fois Autres articles Les 4 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Jailbreaking llm-controlled robots

A Robey, Z Ravichandran, V Kumar, H Hassani… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent introduction of large language models (LLMs) has revolutionized the field of
robotics by enabling contextual reasoning and intuitive human-robot interaction in domains …

Enregistrer Citer Cité 7 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training

Y Yuan, W Jiao, W Wang, J Huang, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

This study addresses a critical gap in safety tuning practices for Large Language Models
(LLMs) by identifying and tackling a refusal position bias within safety tuning data, which …

Enregistrer Citer Cité 8 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Robust LLM safeguarding via refusal feature adversarial training

L Yu, V Do, K Hambardzumyan… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …

Enregistrer Citer Cité 3 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

Gradient routing: Masking gradients to localize computation in neural networks

A Cloud, J Goldman-Wetzler, E Wybitul, J Miller… - arxiv preprint arxiv …, 2024 - arxiv.org

Neural networks are trained primarily based on their inputs and outputs, without regard for
their internal mechanisms. These neglected mechanisms determine properties that are …

Enregistrer Citer Cité 2 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Position: Llm unlearning benchmarks are weak measures of progress

P Thaker, S Hu, N Kale, Y Maurya, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …

Enregistrer Citer Cité 3 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Open Problems in Machine Unlearning for AI Safety

F Barez, T Fu, A Prabhu, S Casper, A Sanyal… - arxiv preprint arxiv …, 2025 - arxiv.org

As AI systems become more capable, widely deployed, and increasingly autonomous in
critical areas such as cybersecurity, biological research, and healthcare, ensuring their …

Enregistrer Citer Cité 1 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] aaai.org

An FDA for AI? Pitfalls and Plausibility of Approval Regulation for Frontier Artificial Intelligence

D Carpenter, C Ezell - Proceedings of the AAAI/ACM Conference on AI …, 2024 - ojs.aaai.org

Observers and practitioners of artificial intelligence (AI) have proposed an FDA-style
licensing regime for the most advanced AI models, or'frontier'models. In this paper, we …

Enregistrer Citer Cité 1 fois Autres articles Les 4 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

A probabilistic perspective on unlearning and alignment for large language models

Y Scholten, S Günnemann, L Schwinn - arxiv preprint arxiv:2410.03523, 2024 - arxiv.org

Comprehensive evaluation of Large Language Models (LLMs) is an open research problem.
Existing evaluations rely on deterministic point estimates generated via greedy decoding …

Enregistrer Citer Cité 1 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

T Wu, S Zhang, K Song, S Xu, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) are susceptible to security and safety threats, such as
prompt injection, prompt extraction, and harmful requests. One major cause of these …

Enregistrer Citer Cité 1 fois Autres articles Les 3 versions Free GPT-4 Version HTML

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Llm defenses are not robust to multi-turn human jailbreaks yet

Jailbreaking llm-controlled robots

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training

Robust LLM safeguarding via refusal feature adversarial training

Gradient routing: Masking gradients to localize computation in neural networks

Position: Llm unlearning benchmarks are weak measures of progress

Open Problems in Machine Unlearning for AI Safety

An FDA for AI? Pitfalls and Plausibility of Approval Regulation for Frontier Artificial Intelligence

A probabilistic perspective on unlearning and alignment for large language models

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy