Exponential smoothing for off-policy learning
Off-policy learning (OPL) aims at finding improved policies from logged bandit data, often by
minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we …
minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we …
Mixed-effect thompson sampling
A contextual bandit is a popular framework for online learning to act under uncertainty. In
practice, the number of actions is huge and their expected rewards are correlated. In this …
practice, the number of actions is huge and their expected rewards are correlated. In this …
Multi-task off-policy learning from bandit feedback
Many practical problems involve solving similar tasks. In recommender systems, the tasks
can be users with similar preferences; in search engines, the tasks can be items with similar …
can be users with similar preferences; in search engines, the tasks can be items with similar …
Hierarchical conversational preference elicitation with bandit feedback
The recent advances of conversational recommendations provide a promising way to
efficiently elicit users' preferences via conversational interactions. To achieve this, the …
efficiently elicit users' preferences via conversational interactions. To achieve this, the …
Goal-Conditioned Hierarchical Reinforcement Learning With High-Level Model Approximation
Hierarchical reinforcement learning (HRL) exhibits remarkable potential in addressing large-
scale and long-horizon complex tasks. However, a fundamental challenge, which arises …
scale and long-horizon complex tasks. However, a fundamental challenge, which arises …
Thompson sampling with diffusion generative prior
In this work, we initiate the idea of using denoising diffusion models to learn priors for online
decision making problems. Our special focus is on the meta-learning for bandit framework …
decision making problems. Our special focus is on the meta-learning for bandit framework …
Prior-Dependent Allocations for Bayesian Fixed-Budget Best-Arm Identification in Structured Bandits
We study the problem of Bayesian fixed-budget best-arm identification (BAI) in structured
bandits. We propose an algorithm that uses fixed allocations based on the prior information …
bandits. We propose an algorithm that uses fixed allocations based on the prior information …
Linear diffusion models meet contextual bandits with large action spaces
I Aouali - 2023 - openreview.net
Efficient exploration is a key challenge in contextual bandits due to the potentially large size
of their action space, where uninformed exploration can result in computational and …
of their action space, where uninformed exploration can result in computational and …
Mixed-Effects Contextual Bandits
We study a novel variant of a contextual bandit problem with multi-dimensional reward
feedback formulated as a mixed-effects model, where the correlations between multiple …
feedback formulated as a mixed-effects model, where the correlations between multiple …
Only pay for what is uncertain: Variance-adaptive thompson sampling
Most bandit algorithms assume that the reward variances or their upper bounds are known,
and that they are the same for all arms. This naturally leads to suboptimal performance and …
and that they are the same for all arms. This naturally leads to suboptimal performance and …