A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Reinforcement Learning: An Overview
K Murphy - arxiv preprint arxiv:2412.05265, 2024 - arxiv.org
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement
learning and sequential decision making, covering value-based RL, policy-gradient …
learning and sequential decision making, covering value-based RL, policy-gradient …
Personalizing reinforcement learning from human feedback with variational preference learning
Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning
foundation models to human values and preferences. However, current RLHF techniques …
foundation models to human values and preferences. However, current RLHF techniques …
Beyond preferences in ai alignment
The dominant practice of AI alignment assumes (1) that preferences are an adequate
representation of human values,(2) that human rationality can be understood in terms of …
representation of human values,(2) that human rationality can be understood in terms of …
Self-consuming generative models with curated data provably optimize human preferences
The rapid progress in generative models has resulted in impressive leaps in generation
quality, blurring the lines between synthetic and real data. Web-scale datasets are now …
quality, blurring the lines between synthetic and real data. Web-scale datasets are now …
Improving context-aware preference modeling for language models
While finetuning language models from pairwise preferences has proven remarkably
effective, the underspecified nature of natural language presents critical challenges. Direct …
effective, the underspecified nature of natural language presents critical challenges. Direct …
On extending direct preference optimization to accommodate ties
We derive and investigate two DPO variants that explicitly model the possibility of declaring
a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well …
a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well …
[PDF][PDF] Elo Ratings in the Presence of Intransitivity
Elo Ratings in the Presense of Intransitivity Page 1 Electronic Journal of Statistics ISSN:
1935-7524 Elo Ratings in the Presense of Intransitivity Adam H. Hamilton1 , Anna …
1935-7524 Elo Ratings in the Presense of Intransitivity Adam H. Hamilton1 , Anna …