A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arxiv preprint arxiv:2310.14554, 2023 - arxiv.org
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

Efficient algorithms for generalized linear bandits with heavy-tailed rewards

B Xue, Y Wang, Y Wan, J Yi… - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper investigates the problem of generalized linear bandits with heavy-tailed rewards,
whose $(1+\epsilon) $-th moment is bounded for some $\epsilon\in (0, 1] $. Although there …

Provable benefits of policy learning from human preferences in contextual bandit problems

X Ji, H Wang, M Chen, T Zhao, M Wang - arxiv preprint arxiv:2307.12975, 2023 - arxiv.org
A crucial task in decision-making problems is reward engineering. It is common in practice
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T **, Y Wu, H Zhao, F Farnoud, Q Gu - arxiv preprint arxiv …, 2023 - arxiv.org
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arxiv preprint arxiv:2402.09401, 2024 - arxiv.org
Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

R Deb, A Saha, A Banerjee - International Conference on …, 2024 - proceedings.mlr.press
We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …

Borda regret minimization for generalized linear dueling bandits

Y Wu, T **, H Lou, F Farnoud, Q Gu - arxiv preprint arxiv:2303.08816, 2023 - arxiv.org
Dueling bandits are widely used to model preferential feedback prevalent in many
applications such as recommendation systems and ranking. In this paper, we study the …

Optimal design for reward modeling in rlhf

A Scheid, E Boursier, A Durmus, MI Jordan… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to
align language models (LMs) with human preferences. This method involves collecting a …