The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts

J Yu, X Lin, Z Yu, X **ng - arxiv preprint arxiv:2309.10253, 2023 - arxiv.org
Large language models (LLMs) have recently experienced tremendous popularity and are
widely used from casual conversations to AI-driven programming. However, despite their …

Fine-tuning aligned language models compromises safety, even when users do not intend to!

X Qi, Y Zeng, T **e, PY Chen, R Jia, P Mittal… - arxiv preprint arxiv …, 2023 - arxiv.org
Optimizing large language models (LLMs) for downstream use cases often involves the
customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama …

Lmsys-chat-1m: A large-scale real-world llm conversation dataset

L Zheng, WL Chiang, Y Sheng, T Li, S Zhuang… - arxiv preprint arxiv …, 2023 - arxiv.org
Studying how people interact with large language models (LLMs) in real-world scenarios is
increasingly important due to their widespread use in various applications. In this paper, we …

Red-Teaming for generative AI: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Defending against alignment-breaking attacks via robustly aligned llm

B Cao, Y Cao, L Lin, J Chen - arxiv preprint arxiv:2309.14348, 2023 - arxiv.org
Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …

Introducing v0. 5 of the ai safety benchmark from mlcommons

B Vidgen, A Agrawal, AM Ahmed, V Akinwande… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces v0. 5 of the AI Safety Benchmark, which has been created by the
MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to …

Improving alignment and robustness with circuit breakers

A Zou, L Phan, J Wang, D Duenas, M Lin… - The Thirty-eighth …, 2024 - openreview.net
AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We
present an approach, inspired by recent advances in representation engineering, that …

Sorry-bench: Systematically evaluating large language model safety refusal behaviors

T **e, X Qi, Y Zeng, Y Huang, UM Sehwag… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating aligned large language models'(LLMs) ability to recognize and reject unsafe user
requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts …

Chatgpt's one-year anniversary: are open-source large language models catching up?

H Chen, F Jiao, X Li, C Qin, M Ravaut, R Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org
Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of
AI, both in research and commerce. Through instruction-tuning a large language model …