[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

B-pref: Benchmarking preference-based reinforcement learning

K Lee, L Smith, A Dragan, P Abbeel - arxiv preprint arxiv:2111.03026, 2021 - arxiv.org
Reinforcement learning (RL) requires access to a reward function that incentivizes the right
behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL …

Benchmarks and algorithms for offline preference-based reward learning

D Shin, AD Dragan, DS Brown - arxiv preprint arxiv:2301.01392, 2023 - arxiv.org
Learning a reward function from human preferences is challenging as it typically requires
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …

Offline preference-based apprenticeship learning

D Shin, DS Brown, AD Dragan - arxiv preprint arxiv:2107.09251, 2021 - arxiv.org
Learning a reward function from human preferences is challenging as it typically requires
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …

Crew: Facilitating human-ai teaming research

L Zhang, Z Ji, B Chen - arxiv preprint arxiv:2408.00170, 2024 - arxiv.org
With the increasing deployment of artificial intelligence (AI) technologies, the potential of
humans working with AI agents has been growing at a great speed. Human-AI teaming is an …

Interpretable reward learning via differentiable decision trees

A Kalra, DS Brown - NeurIPS ML Safety Workshop, 2022 - openreview.net
There is an increasing interest in learning rewards and models of human intent from human
feedback. However, many methods use blackbox learning methods that, while expressive …

Can Differentiable Decision Trees Learn Interpretable Reward Functions?

A Kalra, DS Brown - 2023 - openreview.net
There is an increasing interest in learning reward functions that model human intent and
human preferences. However, many frameworks use blackbox learning methods that, while …

Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?

A Kalra, DS Brown - arxiv preprint arxiv:2306.13004, 2023 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for capturing human intent to alleviate the challenges of hand-crafting the reward …

Expert-in-the-loop for sequential decisions and predictions

K Brantley - 2021 - search.proquest.com
Sequential decisions and predictions are common problems in natural language
processing, robotics, and video games. Essentially, an agent interacts with an environment …

[PDF][PDF] Counterfactual Explanations of Learned Reward Functions

J Wehner - repository.tudelft.nl
As AI systems become widely employed this technology will profoundly impact society. To
ensure this impact is positive it is essential to align these systems with the values and …