Self-play preference optimization for language model alignment
Traditional reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …
Variance-aware regret bounds for stochastic contextual dueling bandits
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …
feedback, a valuable feature that fits various applications involving human interaction, such …
Contextual bandits and imitation learning with preference-based active queries
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
Reinforcement learning from human feedback with active queries
Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …
modern generative models and can be achieved by reinforcement learning from human …
Nearly optimal algorithms for contextual dueling bandits from adversarial feedback
Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be …
as large language models (LLM). However, the effectiveness of this approach can be …
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique
used to enhance policy optimization in reinforcement learning (RL) and reinforcement …
used to enhance policy optimization in reinforcement learning (RL) and reinforcement …
Feel-Good Thompson Sampling for Contextual Dueling Bandits
Contextual dueling bandits, where a learner compares two options based on context and
receives feedback indicating which was preferred, extends classic dueling bandits by …
receives feedback indicating which was preferred, extends classic dueling bandits by …
Constrained Dueling Bandits for Edge Intelligence
Bandit is acknowledged as a classical analytic tool for the online decision-making problem
under uncertainty, eg, task assignment for crowdsourcing systems given the unknown …
under uncertainty, eg, task assignment for crowdsourcing systems given the unknown …
Learning from Human Feedback: Ranking, Bandit, and Preference Optimization
Y Wu - 2024 - search.proquest.com
This dissertation investigates several challenges in artificial intelligence (AI) alignment and
reinforcement learning (RL), particularly focusing on applications when only preference …
reinforcement learning (RL), particularly focusing on applications when only preference …