A minimaximalist approach to reinforcement learning from human feedback

G Swamy, C Dann, R Kidambi, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

Contrastive preference learning: learning from human feedback without rl

J Hejna, R Rafailov, H Sikchi, C Finn, S Niekum… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …

Scaling laws for reward model overoptimization in direct alignment algorithms

R Rafailov, Y Chittepu, R Park, H Sikchi… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent
success of Large Language Models (LLMs), however, it is often a complex and brittle …

Dual rl: Unification and new methods for reinforcement and imitation learning

H Sikchi, Q Zheng, A Zhang, S Niekum - arxiv preprint arxiv:2302.08560, 2023 - arxiv.org
The goal of reinforcement learning (RL) is to find a policy that maximizes the expected
cumulative return. It has been shown that this objective can be represented as an …

Robot air hockey: A manipulation testbed for robot learning with reinforcement learning

C Chuck, C Qi, MJ Munje, S Li, M Rudolph… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning is a promising tool for learning complex policies even in fast-
moving and object-interactive domains where human teleoperation or hard-coded policies …

A dual representation framework for robot learning with human guidance

R Zhang, D Bansal, Y Hao, A Hiranaka… - … on Robot Learning, 2023 - proceedings.mlr.press
The ability to interactively learn skills from human guidance and adjust behavior according
to human preference is crucial to accelerating robot learning. But human guidance is an …

Trajectory improvement and reward learning from comparative language feedback

Z Yang, M Jun, J Tien, SJ Russell, A Dragan… - arxiv preprint arxiv …, 2024 - arxiv.org
Learning from human feedback has gained traction in fields like robotics and natural
language processing in recent years. While prior works mostly rely on human feedback in …

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

K Metcalf, M Sarabia, N Mackraz… - arxiv preprint arxiv …, 2024 - arxiv.org
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human
preferences via a reward function learned from binary feedback over agent behaviors. We …

SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning

H Sikchi, R Chitnis, A Touati, A Geramifard… - arxiv preprint arxiv …, 2023 - arxiv.org
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve
multiple goals in an environment purely from offline datasets using sparse reward functions …

Imitation from arbitrary experience: A dual unification of reinforcement and imitation learning methods

H Sikchi, A Zhang, S Niekum - Workshop on Reincarnating …, 2023 - openreview.net
It is well known that Reinforcement Learning (RL) can be formulated as a convex program
with linear constraints. The dual form of this formulation is unconstrained, which we refer to …