Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

[LIBRO][B] Bandit algorithms

T Lattimore, C Szepesvári - 2020 - books.google.com
Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …

Is rlhf more difficult than standard rl? a theoretical perspective

Y Wang, Q Liu, C ** - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …

Introduction to multi-armed bandits

A Slivkins - Foundations and Trends® in Machine Learning, 2019 - nowpublishers.com
Multi-armed bandits a simple but very powerful framework for algorithms that make
decisions over time under uncertainty. An enormous body of work has accumulated over the …

Towards conversational recommender systems

K Christakopoulou, F Radlinski… - Proceedings of the 22nd …, 2016 - dl.acm.org
People often ask others for restaurant recommendations as a way to discover new dining
experiences. This makes restaurant recommendation an exciting scenario for recommender …

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press
We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

Dueling rl: reinforcement learning with trajectory preferences

A Pacchiano, A Saha, J Lee - arxiv preprint arxiv:2111.04850, 2021 - arxiv.org
We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …

An optimal algorithm for stochastic and adversarial bandits

J Zimmert, Y Seldin - The 22nd International Conference on …, 2019 - proceedings.mlr.press
We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …

Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits

J Zimmert, Y Seldin - Journal of Machine Learning Research, 2021 - jmlr.org
We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org
In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …