Refusal in language models is mediated by a single direction
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …
associated with the malicious use of large language models (LLMs), yet the field lacks a …
Sleeper agents: Training deceptive llms that persist through safety training
Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when …
situations, but then behaving very differently in order to pursue alternative objectives when …
Does Refusal Training in LLMs Generalize to the Past Tense?
M Andriushchenko, N Flammarion - arxiv preprint arxiv:2407.11969, 2024 - arxiv.org
Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or
illegal outputs. We reveal a curious generalization gap in the current refusal training …
illegal outputs. We reveal a curious generalization gap in the current refusal training …
Jailbreakbench: An open robustness benchmark for jailbreaking large language models
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or
otherwise objectionable content. Evaluating these attacks presents a number of challenges …
otherwise objectionable content. Evaluating these attacks presents a number of challenges …
Rapid optimization for jailbreaking llms via subconscious exploitation and echopraxia
Large Language Models (LLMs) have become prevalent across diverse sectors,
transforming human life with their extraordinary reasoning and comprehension abilities. As …
transforming human life with their extraordinary reasoning and comprehension abilities. As …
Mitigating backdoor threats to large language models: Advancement and challenges
The advancement of Large Language Models (LLMs) has significantly impacted various
domains, including Web search, healthcare, and software development. However, as these …
domains, including Web search, healthcare, and software development. However, as these …
Tracrbench: Generating interpretability testbeds with large language models
H Thurnherr, J Scheurer - arxiv preprint arxiv:2409.13714, 2024 - arxiv.org
Achieving a mechanistic understanding of transformer-based language models is an open
challenge, especially due to their large number of parameters. Moreover, the lack of ground …
challenge, especially due to their large number of parameters. Moreover, the lack of ground …
Mrj-agent: An effective jailbreak agent for multi-round dialogue
F Wang, R Duan, P **ao, X Jia, YF Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of
knowledge and understanding capabilities, but they have also been shown to be prone to …
knowledge and understanding capabilities, but they have also been shown to be prone to …
Improved techniques for optimization-based jailbreaking on large language models
Large language models (LLMs) are being rapidly developed, and a key component of their
widespread deployment is their safety-related alignment. Many red-teaming efforts aim to …
widespread deployment is their safety-related alignment. Many red-teaming efforts aim to …