Jailbreak and guard aligned language models with only few in-context demonstrations
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …
Can llm-generated misinformation be detected?
The advent of Large Language Models (LLMs) has made a transformative impact. However,
the potential that LLMs such as ChatGPT can be exploited to generate misinformation has …
the potential that LLMs such as ChatGPT can be exploited to generate misinformation has …
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms
We introduce WildGuard--an open, light-weight moderation tool for LLM safety that achieves
three goals:(1) identifying malicious intent in user prompts,(2) detecting safety risks of model …
three goals:(1) identifying malicious intent in user prompts,(2) detecting safety risks of model …
An adversarial perspective on machine unlearning for ai safety
Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …
these protections can often be bypassed. Unlearning methods aim at completely removing …
Generative language models exhibit social identity biases
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …
Can Editing LLMs Inject Harm?
Knowledge editing has been increasingly adopted to correct the false or outdated
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …
International Scientific Report on the Safety of Advanced AI (Interim Report)
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …
The art of saying no: Contextual noncompliance in language models
Chat-based language models are designed to be helpful, yet they should not comply with
every user request. While most existing work primarily focuses on refusal of" unsafe" …
every user request. While most existing work primarily focuses on refusal of" unsafe" …
What makes and breaks safety fine-tuning? a mechanistic study
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …
their safe deployment. To better understand the underlying factors that make models safe via …
Safety cases for frontier AI
As frontier artificial intelligence (AI) systems become more capable, it becomes more
important that developers can explain why their systems are sufficiently safe. One way to do …
important that developers can explain why their systems are sufficiently safe. One way to do …