Scalable agent alignment via reward modeling: a research direction
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …
lack of suitable reward functions. Designing such reward functions is difficult in part because …
Self-control in cyberspace: Applying dual systems theory to a review of digital self-control tools
Many people struggle to control their use of digital devices. However, our understanding of
the design mechanisms that support user self-control remains limited. In this paper, we make …
the design mechanisms that support user self-control remains limited. In this paper, we make …
Off-policy deep reinforcement learning without exploration
Many practical applications of reinforcement learning constrain agents to learn from a fixed
batch of data which has already been gathered, without offering further possibility for data …
batch of data which has already been gathered, without offering further possibility for data …
Machine theory of mind
Abstract Theory of mind (ToM) broadly refers to humans' ability to represent the mental
states of others, including their desires, beliefs, and intentions. We design a Theory of Mind …
states of others, including their desires, beliefs, and intentions. We design a Theory of Mind …
Concrete problems in AI safety
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing
attention to the potential impacts of AI technologies on society. In this paper we discuss one …
attention to the potential impacts of AI technologies on society. In this paper we discuss one …
Inverse reward design
Autonomous agents optimize the reward function we give them. What they don't know is how
hard it is for us to design a reward function that actually captures what we want. When …
hard it is for us to design a reward function that actually captures what we want. When …
Emotion prediction as computation over a generative theory of mind
From sparse descriptions of events, observers can make systematic and nuanced
predictions of what emotions the people involved will experience. We propose a formal …
predictions of what emotions the people involved will experience. We propose a formal …
Online bayesian goal inference for boundedly rational planning agents
People routinely infer the goals of others by observing their actions over time. Remarkably,
we can do so even when those actions lead to failure, enabling us to assist others when we …
we can do so even when those actions lead to failure, enabling us to assist others when we …
AGI safety literature review
The development of Artificial General Intelligence (AGI) promises to be a major event. Along
with its many potential benefits, it also raises serious safety concerns (Bostrom, 2014). The …
with its many potential benefits, it also raises serious safety concerns (Bostrom, 2014). The …
When humans aren't optimal: Robots that collaborate with risk-aware humans
In order to collaborate safely and efficiently, robots need to anticipate how their human
partners will behave. Some of today's robots model humans as if they were also robots, and …
partners will behave. Some of today's robots model humans as if they were also robots, and …