- Academic Search

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press

We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

Salva Cita Citato da 177 Articoli correlati Tutte e 8 le versioni Versione HTML

[Free GPT-4]

[PDF] tor-lattimore.com

[LIBRO][B] Bandit algorithms

T Lattimore, C Szepesvári - 2020 - books.google.com

Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …

Salva Cita Citato da 3281 Articoli correlati Tutte e 9 le versioni Ricerca biblioteche

[Free GPT-4]

[PDF] neurips.cc

Is rlhf more difficult than standard rl? a theoretical perspective

Y Wang, Q Liu, C ** - Advances in Neural Information …, 2023 - proceedings.neurips.cc

Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …

Salva Cita Citato da 26 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] nowpublishers.com

Introduction to multi-armed bandits

A Slivkins - Foundations and Trends® in Machine Learning, 2019 - nowpublishers.com

Multi-armed bandits a simple but very powerful framework for algorithms that make
decisions over time under uncertainty. An enormous body of work has accumulated over the …

Salva Cita Citato da 1252 Articoli correlati Tutte e 7 le versioni Ricerca biblioteche Versione HTML

[Free GPT-4]

[PDF] microsoft.com

Towards conversational recommender systems

K Christakopoulou, F Radlinski… - Proceedings of the 22nd …, 2016 - dl.acm.org

People often ask others for restaurant recommendations as a way to discover new dining
experiences. This makes restaurant recommendation an exciting scenario for recommender …

Salva Cita Citato da 514 Articoli correlati Tutte e 7 le versioni

[Free GPT-4]

[PDF] mlr.press

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

Salva Cita Citato da 34 Articoli correlati Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Dueling rl: reinforcement learning with trajectory preferences

A Pacchiano, A Saha, J Lee - arxiv preprint arxiv:2111.04850, 2021 - arxiv.org

We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …

Salva Cita Citato da 54 Articoli correlati Tutte e 2 le versioni Versione HTML

[Free GPT-4]

[PDF] mlr.press

An optimal algorithm for stochastic and adversarial bandits

J Zimmert, Y Seldin - The 22nd International Conference on …, 2019 - proceedings.mlr.press

We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …

Salva Cita Citato da 129 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] jmlr.org

Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits

J Zimmert, Y Seldin - Journal of Machine Learning Research, 2021 - jmlr.org

We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both
adversarial and stochastic multi-armed bandits without prior knowledge of the regime and …

Salva Cita Citato da 131 Articoli correlati Tutte e 6 le versioni Versione HTML

[Free GPT-4]

[PDF] jmlr.org

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org

In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …

Salva Cita Citato da 121 Articoli correlati Tutte e 7 le versioni Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Reducing dueling bandits to cardinal bandits

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

[LIBRO][B] Bandit algorithms

Is rlhf more difficult than standard rl? a theoretical perspective

Introduction to multi-armed bandits

Towards conversational recommender systems

Dueling rl: Reinforcement learning with trajectory preferences

Dueling rl: reinforcement learning with trajectory preferences

An optimal algorithm for stochastic and adversarial bandits

Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits

Preference-based online learning with dueling bandits: A survey