Preference-based online learning with dueling bandits: A survey
In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …
problems, in which an agent is supposed to simultaneously explore and exploit a given set …
Efficient and optimal algorithms for contextual dueling bandits under realizability
We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …
setting in which the learner uses contextual information to make two decisions, but only …
Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources
We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …
constraints on resource consumption. As in the classic dueling bandits, at each round the …
Choice bandits
A Agarwal, N Johnson… - Advances in neural …, 2020 - proceedings.neurips.cc
There has been much interest in recent years in the problem of dueling bandits, where on
each round the learner plays a pair of arms and receives as feedback the outcome of a …
each round the learner plays a pair of arms and receives as feedback the outcome of a …
Nested elimination: a simple algorithm for best-item identification from choice-based feedback
We study the problem of best-item identification from choice-based feedback. In this
problem, a company sequentially and adaptively shows display sets to a population of …
problem, a company sequentially and adaptively shows display sets to a population of …
Exploiting correlation to achieve faster learning rates in low-rank preference bandits
Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits
We study the problem of dynamic regret minimization in $ K $-armed Dueling Bandits under
non-stationary or time-varying preferences. This is an online learning setup where the agent …
non-stationary or time-varying preferences. This is an online learning setup where the agent …
Exploiting correlation to achieve faster learning rates in low-rank preference bandits
Abstract We introduce the Correlated Preference Bandits problem with random utility-based
choice models (RUMs), where the goal is to identify the best item from a given pool of $ n …
choice models (RUMs), where the goal is to identify the best item from a given pool of $ n …