Академия Google

H Zhang, J Huang, K Mei, Y Yao, Z Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Although LLM-based agents, powered by Large Language Models (LLMs), can use external
tools and memory mechanisms to solve complex real-world tasks, they may also introduce …

Сохранить Цитировать Цитируется: 7 Похожие статьи Все версии статьи (2) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

JCN Ferrand, Y Beugin, E Pauley, R Sheatsley… - arxiv preprint arxiv …, 2025 - arxiv.org

Alignment in large language models (LLMs) is used to enforce guidelines such as safety.
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe …

Сохранить Цитировать Похожие статьи В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

X Yang, G Deng, J Shi, T Zhang, JS Dong - arxiv preprint arxiv …, 2025 - arxiv.org

Large language models (LLMs) are vital for a wide range of applications yet remain
susceptible to jailbreak threats, which could lead to the generation of inappropriate …

Сохранить Цитировать Похожие статьи Все версии статьи (2) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

A von Recum, C Schnabl, G Hollbeck, S Alberti… - arxiv preprint arxiv …, 2024 - arxiv.org

Refusals-instances where large language models (LLMs) decline or fail to fully execute user
instructions-are crucial for both AI safety and AI capabilities and the reduction of …

Сохранить Цитировать Похожие статьи Все версии статьи (2) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] jcnf.me

[PDF][PDF] Extracting the Harmfulness Classifier of Aligned LLMs

JCN Ferrand - 2024 - jcnf.me

3.1 Methodology overview. In the first step, we estimate the harmfulness classifier of an LLM
by (A) selecting a structure within the model and (B) training a classification head on the …

Сохранить Цитировать Похожие статьи В виде HTML

Создать оповещение

Цитировать

Расширенный поиск

Сохранено в вашей библиотеке

Sorry-bench: Systematically evaluating large language model safety refusal behaviors, 2024

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

[PDF][PDF] Extracting the Harmfulness Classifier of Aligned LLMs