From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

X Liu, Y Zhu, J Gu, Y Lan, C Yang, Y Qiao - European Conference on …, 2024 - Springer
The security concerns surrounding Large Language Models (LLMs) have been extensively
explored, yet the safety of Multimodal Large Language Models (MLLMs) remains …

The unlocking spell on base llms: Rethinking alignment via in-context learning

BY Lin, A Ravichander, X Lu, N Dziri… - The Twelfth …, 2023 - openreview.net
Alignment tuning has become the de facto standard practice for enabling base large
language models (LLMs) to serve as open-domain AI assistants. The alignment tuning …

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Y Zeng, H Lin, J Zhang, D Yang, R Jia… - arxiv preprint arxiv …, 2024 - arxiv.org
Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …

Defending large language models against jailbreaking attacks through goal prioritization

Z Zhang, J Yang, P Ke, F Mi, H Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
While significant attention has been dedicated to exploiting weaknesses in LLMs through
jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We …

Llm self defense: By self examination, llms know they are being tricked

M Phute, A Helbling, M Hull, SY Peng, S Szyller… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) are popular for high-quality text generation but can produce
harmful content, even when aligned with human values through reinforcement learning …

Low-resource languages jailbreak gpt-4

ZX Yong, C Menghini, SH Bach - arxiv preprint arxiv:2310.02446, 2023 - arxiv.org
AI safety training and red-teaming of large language models (LLMs) are measures to
mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual …

Shadow alignment: The ease of subverting safely-aligned language models

X Yang, X Wang, Q Zhang, L Petzold, WY Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Warning: This paper contains examples of harmful language, and reader discretion is
recommended. The increasing open release of powerful large language models (LLMs) has …

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arxiv preprint arxiv …, 2024 - arxiv.org
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

Jailbreak attacks and defenses against large language models: A survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …