Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org
Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T **, Y Wu, H Zhao, F Farnoud, Q Gu - arxiv preprint arxiv …, 2023 - arxiv.org
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arxiv preprint arxiv:2402.09401, 2024 - arxiv.org
Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …

Nearly optimal algorithms for contextual dueling bandits from adversarial feedback

Q Di, J He, Q Gu - arxiv preprint arxiv:2404.10776, 2024 - arxiv.org
Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be …

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

H Zhao, C Ye, Q Gu, T Zhang - arxiv preprint arxiv:2411.04625, 2024 - arxiv.org
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique
used to enhance policy optimization in reinforcement learning (RL) and reinforcement …

Feel-Good Thompson Sampling for Contextual Dueling Bandits

X Li, H Zhao, Q Gu - arxiv preprint arxiv:2404.06013, 2024 - arxiv.org
Contextual dueling bandits, where a learner compares two options based on context and
receives feedback indicating which was preferred, extends classic dueling bandits by …

Constrained Dueling Bandits for Edge Intelligence

S Wang, Z Shao, Y Yang - IEEE Transactions on Network …, 2024 - ieeexplore.ieee.org
Bandit is acknowledged as a classical analytic tool for the online decision-making problem
under uncertainty, eg, task assignment for crowdsourcing systems given the unknown …

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization

Y Wu - 2024 - search.proquest.com
This dissertation investigates several challenges in artificial intelligence (AI) alignment and
reinforcement learning (RL), particularly focusing on applications when only preference …