Red-Teaming for generative AI: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Open-Ethical AI: Advancements in Open-Source Human-Centric Neural Language Models

S Sicari, JF Cevallos M, A Rizzardi… - ACM Computing …, 2024 - dl.acm.org
This survey summarises the most recent methods for building and assessing helpful, honest,
and harmless neural language models, considering small, medium, and large-size models …

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

L Lin, H Mu, Z Zhai, M Wang, Y Wang, R Wang… - Journal of Artificial …, 2025 - jair.org
Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …

Defending jailbreak prompts via in-context adversarial game

Y Zhou, Y Han, H Zhuang, K Guo, Z Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse
applications. However, concerns regarding their security, particularly the vulnerability to …

The ethical security of large language models: A systematic review

F Liu, J Jiang, Y Lu, Z Huang, J Jiang - Frontiers of Engineering …, 2025 - Springer
The widespread application of large language models (LLMs) has highlighted new security
challenges and ethical concerns, attracting significant academic and societal attention …

Summon a demon and bind it: A grounded theory of llm red teaming in the wild

N Inie, J Stray, L Derczynski - arxiv preprint arxiv:2311.06237, 2023 - arxiv.org
Engaging in the deliberate generation of abnormal outputs from large language models
(LLMs) by attacking them is a novel human activity. This paper presents a thorough …

Policy Space Response Oracles: A Survey

A Bighashdel, Y Wang, S McAleer, R Savani… - arxiv preprint arxiv …, 2024 - arxiv.org
In game theory, a game refers to a model of interaction among rational decision-makers or
players, making choices with the goal of achieving their individual objectives. Understanding …

From Natural Language to Extensive-Form Game Representations

S Deng, Y Wang, R Savani - arxiv preprint arxiv:2501.17282, 2025 - arxiv.org
We introduce a framework for translating game descriptions in natural language into
extensive-form representations in game theory, leveraging Large Language Models (LLMs) …

Towards Scalable Automated Alignment of LLMs: A Survey

B Cao, K Lu, X Lu, J Chen, M Ren, H **ang… - arxiv preprint arxiv …, 2024 - arxiv.org
Alignment is the most critical step in building large language models (LLMs) that meet
human needs. With the rapid development of LLMs gradually surpassing human …

Verbalized Bayesian Persuasion

W Li, Y Lin, X Wang, B **, H Zha, B Wang - arxiv preprint arxiv …, 2025 - arxiv.org
Information design (ID) explores how a sender influence the optimal behavior of receivers to
achieve specific objectives. While ID originates from everyday human communication …