Direct preference optimization: Your language model is secretly a reward model

R Rafailov, A Sharma, E Mitchell… - Advances in …, 2023 - proceedings.neurips.cc
While large-scale unsupervised language models (LMs) learn broad world knowledge and
some reasoning skills, achieving precise control of their behavior is difficult due to the …

Using human feedback to fine-tune diffusion models without any reward model

K Yang, J Tao, J Lyu, C Ge, J Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Using reinforcement learning with human feedback (RLHF) has shown significant promise in
fine-tuning diffusion models. Previous methods start by training a reward model that aligns …

[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2023 - proceedings.neurips.cc
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arxiv preprint arxiv:2310.14554, 2023 - arxiv.org
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

Parl: A unified framework for policy alignment in reinforcement learning from human feedback

S Chakraborty, AS Bedi, A Koppel, D Manocha… - arxiv preprint arxiv …, 2023 - arxiv.org
We present a novel unified bilevel optimization-based framework,\textsf {PARL}, formulated
to address the recently highlighted critical issue of policy alignment in reinforcement …

Multi-turn reinforcement learning from preference human feedback

L Shani, A Rosenberg, A Cassel, O Lang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach
for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to …

Rlvf: Learning from verbal feedback without overgeneralization

M Stephan, A Khazatsky, E Mitchell, AS Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The diversity of contexts in which large language models (LLMs) are deployed requires the
ability to modify or customize default model behaviors to incorporate nuanced requirements …

Reward model learning vs. direct policy optimization: A comparative analysis of learning from human preferences

A Nika, D Mandal, P Kamalaruban, G Tzannetos… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we take a step towards a deeper understanding of learning from human
preferences by systematically comparing the paradigm of reinforcement learning from …

On championing foundation models: From explainability to interpretability

S Fu, Y Chen, Y Wang, D Tao - arxiv preprint arxiv:2410.11444, 2024 - arxiv.org
Understanding the inner mechanisms of black-box foundation models (FMs) is essential yet
challenging in artificial intelligence and its applications. Over the last decade, the long …