Diffusion model alignment using direct preference optimization

B Wallace, M Dang, R Rafailov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better …

Inverse preference learning: Preference-based rl without a reward function

J Hejna, D Sadigh - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …

Ceil: Generalized contextual imitation learning

J Liu, L He, Y Kang, Z Zhuang… - Advances in Neural …, 2023 - proceedings.neurips.cc
In this paper, we present ContExtual Imitation Learning (CEIL), a general and broadly
applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight …

Discriminator-weighted offline imitation learning from suboptimal demonstrations

H Xu, X Zhan, H Yin, H Qin - International Conference on …, 2022 - proceedings.mlr.press
We study the problem of offline Imitation Learning (IL) where an agent aims to learn an
optimal expert behavior policy without additional online environment interactions. Instead …

Learning agile skills via adversarial imitation of rough partial demonstrations

C Li, M Vlastelica, S Blaes, J Frey… - … on Robot Learning, 2023 - proceedings.mlr.press
Learning agile skills is one of the main challenges in robotics. To this end, reinforcement
learning approaches have achieved impressive results. These methods require explicit task …

Extreme q-learning: Maxent rl without entropy

D Garg, J Hejna, M Geist, S Ermon - arxiv preprint arxiv:2301.02328, 2023 - arxiv.org
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-
value, which are difficult to compute in continuous domains with an infinite number of …

Benchmarks and algorithms for offline preference-based reward learning

D Shin, AD Dragan, DS Brown - arxiv preprint arxiv:2301.01392, 2023 - arxiv.org
Learning a reward function from human preferences is challenging as it typically requires
having a high-fidelity simulator or using expensive and potentially unsafe actual physical …

Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution

Z Liang, Y Mu, H Ma, M Tomizuka… - Proceedings of the …, 2024 - openaccess.thecvf.com
Diffusion models have demonstrated strong potential for robotic trajectory planning.
However generating coherent trajectories from high-level instructions remains challenging …

Inverse reinforcement learning as the algorithmic basis for theory of mind: current methods and open problems

J Ruiz-Serra, MS Harré - Algorithms, 2023 - mdpi.com
Theory of mind (ToM) is the psychological construct by which we model another's internal
mental states. Through ToM, we adjust our own behaviour to best suit a social context, and …

Maximum-likelihood inverse reinforcement learning with finite-time guarantees

S Zeng, C Li, A Garcia, M Hong - Advances in Neural …, 2022 - proceedings.neurips.cc
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated
optimal policy that best fits observed sequences of states and actions implemented by an …