- Academic Search

A Arditi, O Obeso, A Syed, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

保存引用被引用次数：51 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arxiv preprint arxiv …, 2024 - arxiv.org

Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

保存引用被引用次数：160 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Sleeper agents: Training deceptive llms that persist through safety training

E Hubinger, C Denison, J Mu, M Lambert… - arxiv preprint arxiv …, 2024 - arxiv.org

Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when …

保存引用被引用次数：68 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Does Refusal Training in LLMs Generalize to the Past Tense?

M Andriushchenko, N Flammarion - arxiv preprint arxiv:2407.11969, 2024 - arxiv.org

Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or
illegal outputs. We reveal a curious generalization gap in the current refusal training …

保存引用被引用次数：14 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

P Chao, E Debenedetti, A Robey… - arxiv preprint arxiv …, 2024 - arxiv.org

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or
otherwise objectionable content. Evaluating these attacks presents a number of challenges …

保存引用被引用次数：93 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Rapid optimization for jailbreaking llms via subconscious exploitation and echopraxia

G Shen, S Cheng, K Zhang, G Tao, S An, L Yan… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have become prevalent across diverse sectors,
transforming human life with their extraordinary reasoning and comprehension abilities. As …

保存引用被引用次数：9 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mitigating backdoor threats to large language models: Advancement and challenges

Q Liu, W Mo, T Tong, J Xu, F Wang… - 2024 60th Annual …, 2024 - ieeexplore.ieee.org

The advancement of Large Language Models (LLMs) has significantly impacted various
domains, including Web search, healthcare, and software development. However, as these …

保存引用被引用次数：1 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Tracrbench: Generating interpretability testbeds with large language models

H Thurnherr, J Scheurer - arxiv preprint arxiv:2409.13714, 2024 - arxiv.org

Achieving a mechanistic understanding of transformer-based language models is an open
challenge, especially due to their large number of parameters. Moreover, the lack of ground …

保存引用被引用次数：2 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mrj-agent: An effective jailbreak agent for multi-round dialogue

F Wang, R Duan, P **ao, X Jia, YF Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of
knowledge and understanding capabilities, but they have also been shown to be prone to …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Improved techniques for optimization-based jailbreaking on large language models

X Jia, T Pang, C Du, Y Huang, J Gu, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are being rapidly developed, and a key component of their
widespread deployment is their safety-related alignment. Many red-teaming efforts aim to …

保存引用被引用次数：18 相关文章所有 2 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

The trojan detection challenge

Refusal in language models is mediated by a single direction

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Sleeper agents: Training deceptive llms that persist through safety training

Does Refusal Training in LLMs Generalize to the Past Tense?

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Rapid optimization for jailbreaking llms via subconscious exploitation and echopraxia

Mitigating backdoor threats to large language models: Advancement and challenges

Tracrbench: Generating interpretability testbeds with large language models

Mrj-agent: An effective jailbreak agent for multi-round dialogue

Improved techniques for optimization-based jailbreaking on large language models