Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arxiv preprint arxiv …, 2024 - arxiv.org
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

Sleeper agents: Training deceptive llms that persist through safety training

E Hubinger, C Denison, J Mu, M Lambert… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans are capable of strategically deceptive behavior: behaving helpfully in most
situations, but then behaving very differently in order to pursue alternative objectives when …

Does Refusal Training in LLMs Generalize to the Past Tense?

M Andriushchenko, N Flammarion - arxiv preprint arxiv:2407.11969, 2024 - arxiv.org
Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or
illegal outputs. We reveal a curious generalization gap in the current refusal training …

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

P Chao, E Debenedetti, A Robey… - arxiv preprint arxiv …, 2024 - arxiv.org
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or
otherwise objectionable content. Evaluating these attacks presents a number of challenges …

Rapid optimization for jailbreaking llms via subconscious exploitation and echopraxia

G Shen, S Cheng, K Zhang, G Tao, S An, L Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have become prevalent across diverse sectors,
transforming human life with their extraordinary reasoning and comprehension abilities. As …

Mitigating backdoor threats to large language models: Advancement and challenges

Q Liu, W Mo, T Tong, J Xu, F Wang… - 2024 60th Annual …, 2024 - ieeexplore.ieee.org
The advancement of Large Language Models (LLMs) has significantly impacted various
domains, including Web search, healthcare, and software development. However, as these …

Tracrbench: Generating interpretability testbeds with large language models

H Thurnherr, J Scheurer - arxiv preprint arxiv:2409.13714, 2024 - arxiv.org
Achieving a mechanistic understanding of transformer-based language models is an open
challenge, especially due to their large number of parameters. Moreover, the lack of ground …

Mrj-agent: An effective jailbreak agent for multi-round dialogue

F Wang, R Duan, P **ao, X Jia, YF Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of
knowledge and understanding capabilities, but they have also been shown to be prone to …

Improved techniques for optimization-based jailbreaking on large language models

X Jia, T Pang, C Du, Y Huang, J Gu, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) are being rapidly developed, and a key component of their
widespread deployment is their safety-related alignment. Many red-teaming efforts aim to …