From generation to judgment: Opportunities and challenges of llm-as-a-judge
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …
and natural language processing (NLP). However, traditional methods, whether matching …
Mm-safetybench: A benchmark for safety evaluation of multimodal large language models
The security concerns surrounding Large Language Models (LLMs) have been extensively
explored, yet the safety of Multimodal Large Language Models (MLLMs) remains …
explored, yet the safety of Multimodal Large Language Models (MLLMs) remains …
The unlocking spell on base llms: Rethinking alignment via in-context learning
Alignment tuning has become the de facto standard practice for enabling base large
language models (LLMs) to serve as open-domain AI assistants. The alignment tuning …
language models (LLMs) to serve as open-domain AI assistants. The alignment tuning …
How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms
Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …
algorithm-focused attacks developed by security experts. As large language models (LLMs) …
Defending large language models against jailbreaking attacks through goal prioritization
While significant attention has been dedicated to exploiting weaknesses in LLMs through
jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We …
jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We …
Llm self defense: By self examination, llms know they are being tricked
Large language models (LLMs) are popular for high-quality text generation but can produce
harmful content, even when aligned with human values through reinforcement learning …
harmful content, even when aligned with human values through reinforcement learning …
Low-resource languages jailbreak gpt-4
AI safety training and red-teaming of large language models (LLMs) are measures to
mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual …
mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual …
Shadow alignment: The ease of subverting safely-aligned language models
Warning: This paper contains examples of harmful language, and reader discretion is
recommended. The increasing open release of powerful large language models (LLMs) has …
recommended. The increasing open release of powerful large language models (LLMs) has …
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …
associated with the malicious use of large language models (LLMs), yet the field lacks a …
Jailbreak attacks and defenses against large language models: A survey
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …
tasks, including question answering, translation, code completion, etc. However, the over …