A minimaximalist approach to reinforcement learning from human feedback

G Swamy, C Dann, R Kidambi, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

Bond: Aligning llms with best-of-n distillation

PG Sessa, R Dadashi, L Hussenot, J Ferret… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in
state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time …

Rlhf workflow: From reward modeling to online rlhf

H Dong, W **ong, B Pang, H Wang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Sharp analysis for kl-regularized contextual bandits and rlhf

H Zhao, C Ye, Q Gu, T Zhang - arxiv preprint arxiv:2411.04625, 2024 - arxiv.org
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique
used to enhance policy optimization in reinforcement learning (RL) and reinforcement …

Accelerating Goal-Conditioned RL Algorithms and Research

M Bortkiewicz, W Pałucki, V Myers, T Dziarmaga… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-supervision has the potential to transform reinforcement learning (RL), paralleling the
breakthroughs it has enabled in other areas of machine learning. While self-supervised …

Jackpot! Alignment as a Maximal Lottery

RR Maura-Rivero, M Lanctot, F Visin… - arxiv preprint arxiv …, 2025 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large
Language Models (LLMs) with human values, is known to fail to satisfy properties that are …

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

S Zhang, Z Liu, B Liu, Y Zhang, Y Yang, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Preference alignment in Large Language Models (LLMs) has significantly improved their
ability to adhere to human instructions and intentions. However, existing direct alignment …

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization

Y Wu - 2024 - search.proquest.com
This dissertation investigates several challenges in artificial intelligence (AI) alignment and
reinforcement learning (RL), particularly focusing on applications when only preference …

[PDF][PDF] ACCELERATING GOAL-CONDITIONED REINFORCE-MENT LEARNING ALGORITHMS AND RESEARCH

M Bortkiewicz, W Pałucki, V Myers, T Dziarmaga… - people.eecs.berkeley.edu
Self-supervision has the potential to transform reinforcement learning (RL), paralleling the
breakthroughs it has enabled in other areas of machine learning. While self-supervised …