Diffusion model alignment using direct preference optimization
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better …
Reinforcement Learning from Human Feedback (RLHF) methods to make them better …
Inverse preference learning: Preference-based rl without a reward function
Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …
based Reinforcement Learning (RL) algorithms address these problems by learning reward …
Ceil: Generalized contextual imitation learning
In this paper, we present ContExtual Imitation Learning (CEIL), a general and broadly
applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight …
applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight …
Discriminator-weighted offline imitation learning from suboptimal demonstrations
We study the problem of offline Imitation Learning (IL) where an agent aims to learn an
optimal expert behavior policy without additional online environment interactions. Instead …
optimal expert behavior policy without additional online environment interactions. Instead …
Learning agile skills via adversarial imitation of rough partial demonstrations
Learning agile skills is one of the main challenges in robotics. To this end, reinforcement
learning approaches have achieved impressive results. These methods require explicit task …
learning approaches have achieved impressive results. These methods require explicit task …
Extreme q-learning: Maxent rl without entropy
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-
value, which are difficult to compute in continuous domains with an infinite number of …
value, which are difficult to compute in continuous domains with an infinite number of …
Benchmarks and algorithms for offline preference-based reward learning
Learning a reward function from human preferences is challenging as it typically requires
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …
Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution
Diffusion models have demonstrated strong potential for robotic trajectory planning.
However generating coherent trajectories from high-level instructions remains challenging …
However generating coherent trajectories from high-level instructions remains challenging …
Inverse reinforcement learning as the algorithmic basis for theory of mind: current methods and open problems
Theory of mind (ToM) is the psychological construct by which we model another's internal
mental states. Through ToM, we adjust our own behaviour to best suit a social context, and …
mental states. Through ToM, we adjust our own behaviour to best suit a social context, and …
Maximum-likelihood inverse reinforcement learning with finite-time guarantees
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated
optimal policy that best fits observed sequences of states and actions implemented by an …
optimal policy that best fits observed sequences of states and actions implemented by an …