A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Reinforcement Learning: An Overview

K Murphy - arxiv preprint arxiv:2412.05265, 2024 - arxiv.org
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement
learning and sequential decision making, covering value-based RL, policy-gradient …

Personalizing reinforcement learning from human feedback with variational preference learning

S Poddar, Y Wan, H Ivison, A Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning
foundation models to human values and preferences. However, current RLHF techniques …

Beyond preferences in ai alignment

T Zhi-Xuan, M Carroll, M Franklin, H Ashton - Philosophical Studies, 2024 - Springer
The dominant practice of AI alignment assumes (1) that preferences are an adequate
representation of human values,(2) that human rationality can be understood in terms of …

Self-consuming generative models with curated data provably optimize human preferences

D Ferbach, Q Bertrand, AJ Bose, G Gidel - arxiv preprint arxiv:2407.09499, 2024 - arxiv.org
The rapid progress in generative models has resulted in impressive leaps in generation
quality, blurring the lines between synthetic and real data. Web-scale datasets are now …

Improving context-aware preference modeling for language models

S Pitis, Z **ao, NL Roux, A Sordoni - arxiv preprint arxiv:2407.14916, 2024 - arxiv.org
While finetuning language models from pairwise preferences has proven remarkably
effective, the underspecified nature of natural language presents critical challenges. Direct …

On extending direct preference optimization to accommodate ties

J Chen, G Yang, W Lin, J Mei, B Byrne - arxiv preprint arxiv:2409.17431, 2024 - arxiv.org
We derive and investigate two DPO variants that explicitly model the possibility of declaring
a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well …

[PDF][PDF] Elo Ratings in the Presence of Intransitivity

AH Hamilton, M Roughan, A Kalenkova - arxiv preprint arxiv:2412.14427, 2024 - arxiv.org
Elo Ratings in the Presense of Intransitivity Page 1 Electronic Journal of Statistics ISSN:
1935-7524 Elo Ratings in the Presense of Intransitivity Adam H. Hamilton1 , Anna …