A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Making rl with preference-based feedback efficient via randomization
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …
efficient in terms of statistical complexity, computational complexity, and query complexity. In …
Efficient algorithms for generalized linear bandits with heavy-tailed rewards
This paper investigates the problem of generalized linear bandits with heavy-tailed rewards,
whose $(1+\epsilon) $-th moment is bounded for some $\epsilon\in (0, 1] $. Although there …
whose $(1+\epsilon) $-th moment is bounded for some $\epsilon\in (0, 1] $. Although there …
Provable benefits of policy learning from human preferences in contextual bandit problems
A crucial task in decision-making problems is reward engineering. It is common in practice
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …
Variance-aware regret bounds for stochastic contextual dueling bandits
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …
feedback, a valuable feature that fits various applications involving human interaction, such …
Contextual bandits and imitation learning with preference-based active queries
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
Reinforcement learning from human feedback with active queries
Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …
modern generative models and can be achieved by reinforcement learning from human …
Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources
We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …
constraints on resource consumption. As in the classic dueling bandits, at each round the …
Borda regret minimization for generalized linear dueling bandits
Dueling bandits are widely used to model preferential feedback prevalent in many
applications such as recommendation systems and ranking. In this paper, we study the …
applications such as recommendation systems and ranking. In this paper, we study the …
Optimal design for reward modeling in rlhf
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to
align language models (LMs) with human preferences. This method involves collecting a …
align language models (LMs) with human preferences. This method involves collecting a …