Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
[LIBRO][B] Bandit algorithms
T Lattimore, C Szepesvári - 2020 - books.google.com
Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …
and the multi-armed bandit model is a commonly used framework to address it. This …
Is rlhf more difficult than standard rl? a theoretical perspective
Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …
Introduction to multi-armed bandits
A Slivkins - Foundations and Trends® in Machine Learning, 2019 - nowpublishers.com
Multi-armed bandits a simple but very powerful framework for algorithms that make
decisions over time under uncertainty. An enormous body of work has accumulated over the …
decisions over time under uncertainty. An enormous body of work has accumulated over the …
Towards conversational recommender systems
People often ask others for restaurant recommendations as a way to discover new dining
experiences. This makes restaurant recommendation an exciting scenario for recommender …
experiences. This makes restaurant recommendation an exciting scenario for recommender …
Dueling rl: Reinforcement learning with trajectory preferences
We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …
Dueling rl: reinforcement learning with trajectory preferences
We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …
An optimal algorithm for stochastic and adversarial bandits
We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …
Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits
We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …
Preference-based online learning with dueling bandits: A survey
In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …
problems, in which an agent is supposed to simultaneously explore and exploit a given set …