Scalable agent alignment via reward modeling: a research direction

J Leike, D Krueger, T Everitt, M Martic, V Maini… - arxiv preprint arxiv …, 2018 - arxiv.org
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …

Self-control in cyberspace: Applying dual systems theory to a review of digital self-control tools

U Lyngs, K Lukoff, P Slovak, R Binns, A Slack… - proceedings of the …, 2019 - dl.acm.org
Many people struggle to control their use of digital devices. However, our understanding of
the design mechanisms that support user self-control remains limited. In this paper, we make …

Off-policy deep reinforcement learning without exploration

S Fujimoto, D Meger, D Precup - … conference on machine …, 2019 - proceedings.mlr.press
Many practical applications of reinforcement learning constrain agents to learn from a fixed
batch of data which has already been gathered, without offering further possibility for data …

Machine theory of mind

N Rabinowitz, F Perbet, F Song… - International …, 2018 - proceedings.mlr.press
Abstract Theory of mind (ToM) broadly refers to humans' ability to represent the mental
states of others, including their desires, beliefs, and intentions. We design a Theory of Mind …

Concrete problems in AI safety

D Amodei, C Olah, J Steinhardt, P Christiano… - arxiv preprint arxiv …, 2016 - arxiv.org
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing
attention to the potential impacts of AI technologies on society. In this paper we discuss one …

Inverse reward design

D Hadfield-Menell, S Milli, P Abbeel… - Advances in neural …, 2017 - proceedings.neurips.cc
Autonomous agents optimize the reward function we give them. What they don't know is how
hard it is for us to design a reward function that actually captures what we want. When …

Emotion prediction as computation over a generative theory of mind

SD Houlihan, M Kleiman-Weiner… - … of the Royal …, 2023 - royalsocietypublishing.org
From sparse descriptions of events, observers can make systematic and nuanced
predictions of what emotions the people involved will experience. We propose a formal …

Online bayesian goal inference for boundedly rational planning agents

T Zhi-Xuan, J Mann, T Silver… - Advances in neural …, 2020 - proceedings.neurips.cc
People routinely infer the goals of others by observing their actions over time. Remarkably,
we can do so even when those actions lead to failure, enabling us to assist others when we …

AGI safety literature review

T Everitt, G Lea, M Hutter - arxiv preprint arxiv:1805.01109, 2018 - arxiv.org
The development of Artificial General Intelligence (AGI) promises to be a major event. Along
with its many potential benefits, it also raises serious safety concerns (Bostrom, 2014). The …

When humans aren't optimal: Robots that collaborate with risk-aware humans

M Kwon, E Biyik, A Talati, K Bhasin, DP Losey… - Proceedings of the …, 2020 - dl.acm.org
In order to collaborate safely and efficiently, robots need to anticipate how their human
partners will behave. Some of today's robots model humans as if they were also robots, and …