Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Human-robot teaming: grand challenges

M Natarajan, E Seraj, B Altundas, R Paleja, S Ye… - Current Robotics …, 2023 - Springer
Abstract Purpose of Review Current real-world interaction between humans and robots is
extremely limited. We present challenges that, if addressed, will enable humans and robots …

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

A Pan, JS Chan, A Zou, N Li, S Basart… - International …, 2023 - proceedings.mlr.press
Artificial agents have traditionally been trained to maximize reward, which may incentivize
power-seeking and deception, analogous to how next-token prediction in language models …

Vision-language models as success detectors

Y Du, K Konyushkova, M Denil, A Raju… - arxiv preprint arxiv …, 2023 - arxiv.org
Detecting successful behaviour is crucial for training intelligent agents. As such,
generalisable reward models are a prerequisite for agents that can learn to generalise their …

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

B-pref: Benchmarking preference-based reinforcement learning

K Lee, L Smith, A Dragan, P Abbeel - arxiv preprint arxiv:2111.03026, 2021 - arxiv.org
Reinforcement learning (RL) requires access to a reward function that incentivizes the right
behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL …

Warm: On the benefits of weight averaged reward models

A Ramé, N Vieillard, L Hussenot, R Dadashi… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …

Benchmarks and algorithms for offline preference-based reward learning

D Shin, AD Dragan, DS Brown - arxiv preprint arxiv:2301.01392, 2023 - arxiv.org
Learning a reward function from human preferences is challenging as it typically requires
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …

What would jiminy cricket do? towards agents that behave morally

D Hendrycks, M Mazeika, A Zou, S Patel, C Zhu… - arxiv preprint arxiv …, 2021 - arxiv.org
When making everyday decisions, people are guided by their conscience, an internal sense
of right and wrong. By contrast, artificial agents are currently not endowed with a moral …

Aligning robot and human representations

A Bobu, A Peng, P Agrawal, J Shah… - arxiv preprint arxiv …, 2023 - arxiv.org
To act in the world, robots rely on a representation of salient task aspects: for example, to
carry a coffee mug, a robot may consider movement efficiency or mug orientation in its …