Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - ar** generalist AI systems that can autonomously act and pursue goals. Increases in …

[HTML][HTML] AI deception: A survey of examples, risks, and potential solutions

PS Park, S Goldstein, A O'Gara, M Chen, D Hendrycks - Patterns, 2024 - cell.com
This paper argues that a range of current AI systems have learned how to deceive humans.
We define deception as the systematic inducement of false beliefs in the pursuit of some …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

The alignment problem from a deep learning perspective

R Ngo, L Chan, S Mindermann - arxiv preprint arxiv:2209.00626, 2022 - arxiv.org
In coming years or decades, artificial general intelligence (AGI) may surpass human
capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs …

Black-box access is insufficient for rigorous ai audits

S Casper, C Ezell, C Siegmann, N Kolt… - Proceedings of the …, 2024 - dl.acm.org
External audits of AI systems are increasingly recognized as a key mechanism for AI
governance. The effectiveness of an audit, however, depends on the degree of access …

Watch out for your agents! investigating backdoor threats to llm-based agents

W Yang, X Bi, Y Lin, S Chen… - Advances in Neural …, 2025 - proceedings.neurips.cc
Driven by the rapid development of Large Language Models (LLMs), LLM-based agents
have been developed to handle various real-world applications, including finance …

Debating with more persuasive llms leads to more truthful answers

A Khan, J Hughes, D Valentine, L Ruis… - arxiv preprint arxiv …, 2024 - arxiv.org
Common methods for aligning large language models (LLMs) with desired behaviour
heavily rely on human-labelled data. However, as models grow increasingly sophisticated …