Red-Teaming for generative AI: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Large language model supply chain: A research agenda

S Wang, Y Zhao, X Hou, H Wang - ACM Transactions on Software …, 2024 - dl.acm.org
The rapid advancement of large language models (LLMs) has revolutionized artificial
intelligence, introducing unprecedented capabilities in natural language processing and …

Privacy in large language models: Attacks, defenses and future directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang… - arxiv preprint arxiv …, 2023 - arxiv.org
The advancement of large language models (LLMs) has significantly enhanced the ability to
effectively tackle various downstream NLP tasks and unify these tasks into generative …

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arxiv preprint arxiv …, 2024 - arxiv.org
Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

Jailbreak attacks and defenses against large language models: A survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …

Open-Ethical AI: Advancements in Open-Source Human-Centric Neural Language Models

S Sicari, JF Cevallos M, A Rizzardi… - ACM Computing …, 2024 - dl.acm.org
This survey summarises the most recent methods for building and assessing helpful, honest,
and harmless neural language models, considering small, medium, and large-size models …

Llm defenses are not robust to multi-turn human jailbreaks yet

N Li, Z Han, I Steneker, W Primack, R Goodside… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent large language model (LLM) defenses have greatly improved models' ability to
refuse harmful queries, even when adversarially attacked. However, LLM defenses are …

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H **, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

Rainbow teaming: Open-ended generation of diverse adversarial prompts

M Samvelyan, SC Raparthy, A Lupu, E Hambro… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) become increasingly prevalent across many real-world
applications, understanding and enhancing their robustness to user inputs is of paramount …

Self-supervised visual preference alignment

K Zhu, L Zhao, Z Ge, X Zhang - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
This paper makes the first attempt towards unsupervised preference alignment in Vision-
Language Models (VLMs). We generate chosen and rejected responses with regard to the …