[HTML][HTML] AI deception: A survey of examples, risks, and potential solutions
This paper argues that a range of current AI systems have learned how to deceive humans.
We define deception as the systematic inducement of false beliefs in the pursuit of some …
We define deception as the systematic inducement of false beliefs in the pursuit of some …
On scientific understanding with artificial intelligence
An oracle that correctly predicts the outcome of every particle physics experiment, the
products of every possible chemical reaction or the function of every protein would …
products of every possible chemical reaction or the function of every protein would …
Scaling laws for reward model overoptimization
In reinforcement learning from human feedback, it is common to optimize against a reward
model trained to predict human preferences. Because the reward model is an imperfect …
model trained to predict human preferences. Because the reward model is an imperfect …
Guiding pretraining in reinforcement learning with large language models
Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped
reward function. Intrinsically motivated exploration methods address this limitation by …
reward function. Intrinsically motivated exploration methods address this limitation by …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Unsolved problems in ml safety
Machine learning (ML) systems are rapidly increasing in size, are acquiring new
capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …
capabilities, and are increasingly deployed in high-stakes settings. As with other powerful …
Shortcut learning in deep neural networks
Deep learning has triggered the current rise of artificial intelligence and is the workhorse of
today's machine intelligence. Numerous success stories have rapidly spread all over …
today's machine intelligence. Numerous success stories have rapidly spread all over …
Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
The alignment problem from a deep learning perspective
In coming decades, artificial general intelligence (AGI) may surpass human capabilities at
many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to …
many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to …
Emergent tool use from multi-agent autocurricula
Through multi-agent competition, the simple objective of hide-and-seek, and standard
reinforcement learning algorithms at scale, we find that agents create a self-supervised …
reinforcement learning algorithms at scale, we find that agents create a self-supervised …