Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Human-robot teaming: grand challenges
Abstract Purpose of Review Current real-world interaction between humans and robots is
extremely limited. We present challenges that, if addressed, will enable humans and robots …
extremely limited. We present challenges that, if addressed, will enable humans and robots …
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark
Artificial agents have traditionally been trained to maximize reward, which may incentivize
power-seeking and deception, analogous to how next-token prediction in language models …
power-seeking and deception, analogous to how next-token prediction in language models …
Vision-language models as success detectors
Detecting successful behaviour is crucial for training intelligent agents. As such,
generalisable reward models are a prerequisite for agents that can learn to generalise their …
generalisable reward models are a prerequisite for agents that can learn to generalise their …
A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
B-pref: Benchmarking preference-based reinforcement learning
Reinforcement learning (RL) requires access to a reward function that incentivizes the right
behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL …
behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL …
Warm: On the benefits of weight averaged reward models
Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …
Benchmarks and algorithms for offline preference-based reward learning
Learning a reward function from human preferences is challenging as it typically requires
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …
What would jiminy cricket do? towards agents that behave morally
When making everyday decisions, people are guided by their conscience, an internal sense
of right and wrong. By contrast, artificial agents are currently not endowed with a moral …
of right and wrong. By contrast, artificial agents are currently not endowed with a moral …
Aligning robot and human representations
To act in the world, robots rely on a representation of salient task aspects: for example, to
carry a coffee mug, a robot may consider movement efficiency or mug orientation in its …
carry a coffee mug, a robot may consider movement efficiency or mug orientation in its …