Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Causal confusion and reward misidentification in preference-based reward learning

J Tien, JZY He, Z Erickson, AD Dragan… - arxiv preprint arxiv …, 2022 - arxiv.org
Learning policies via preference-based reward learning is an increasingly popular method
for customizing agent behavior, but has been shown anecdotally to be prone to spurious …

A taxonomy for similarity metrics between markov decision processes

J García, Á Visús, F Fernández - Machine Learning, 2022 - Springer
Although the notion of task similarity is potentially interesting in a wide range of areas such
as curriculum learning or automated planning, it has mostly been tied to transfer learning …

STARC: A general framework for quantifying differences between reward functions

J Skalse, L Farnik, SR Motwani, E Jenner… - arxiv preprint arxiv …, 2023 - arxiv.org
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal
of that task as a reward function. However, for many real-world tasks, it is very difficult to …

Metarm: Shifted distributions alignment via meta-learning

S Dou, Y Liu, E Zhou, T Li, H Jia, L **ong… - arxiv preprint arxiv …, 2024 - arxiv.org
The success of Reinforcement Learning from Human Feedback (RLHF) in language model
alignment is critically dependent on the capability of the reward model (RM). However, as …

Quantifying the sensitivity of inverse reinforcement learning to misspecification

J Skalse, A Abate - arxiv preprint arxiv:2403.06854, 2024 - arxiv.org
Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a
reward function $ R $) from their behaviour (represented as a policy $\pi $). To do this, we …

A generalized acquisition function for preference-based reward learning

E Ellis, GR Ghosal, SJ Russell… - … on Robotics and …, 2024 - ieeexplore.ieee.org
Preference-based reward learning is a popular technique for teaching robots and
autonomous systems how a human user wants them to perform a task. Previous works have …

A general framework for reward function distances

E Jenner, JMV Skalse, A Gleave - NeurIPS ML Safety Workshop, 2022 - openreview.net
In reward learning, it is helpful to be able to measure distances between reward functions,
for example to evaluate learned reward models. Using simple metrics such as L^ 2 …

Partial Identifiability and Misspecification in Inverse Reinforcement Learning

J Skalse, A Abate - arxiv preprint arxiv:2411.15951, 2024 - arxiv.org
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $ R $ from a
policy $\pi $. This problem is difficult, for several reasons. First of all, there are typically …