Jailbreak and guard aligned language models with only few in-context demonstrations

Z Wei, Y Wang, A Li, Y Mo, Y Wang - arxiv preprint arxiv:2310.06387, 2023 - arxiv.org
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …

Can llm-generated misinformation be detected?

C Chen, K Shu - arxiv preprint arxiv:2309.13788, 2023 - arxiv.org
The advent of Large Language Models (LLMs) has made a transformative impact. However,
the potential that LLMs such as ChatGPT can be exploited to generate misinformation has …

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

S Han, K Rao, A Ettinger, L Jiang, BY Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce WildGuard--an open, light-weight moderation tool for LLM safety that achieves
three goals:(1) identifying malicious intent in user prompts,(2) detecting safety risks of model …

An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Generative language models exhibit social identity biases

T Hu, Y Kyrychenko, S Rathje, N Collier… - Nature Computational …, 2024 - nature.com
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …

Can Editing LLMs Inject Harm?

C Chen, B Huang, Z Li, Z Chen, S Lai, X Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge editing has been increasingly adopted to correct the false or outdated
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arxiv preprint arxiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

The art of saying no: Contextual noncompliance in language models

F Brahman, S Kumar, V Balachandran, P Dasigi… - arxiv preprint arxiv …, 2024 - arxiv.org
Chat-based language models are designed to be helpful, yet they should not comply with
every user request. While most existing work primarily focuses on refusal of" unsafe" …

What makes and breaks safety fine-tuning? a mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy, PHS Torr… - arxiv preprint arxiv …, 2024 - arxiv.org
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

Safety cases for frontier AI

MD Buhl, G Sett, L Koessler, J Schuett… - arxiv preprint arxiv …, 2024 - arxiv.org
As frontier artificial intelligence (AI) systems become more capable, it becomes more
important that developers can explain why their systems are sufficiently safe. One way to do …