Exponential smoothing for off-policy learning

I Aouali, VE Brunel, D Rohde… - … Conference on Machine …, 2023 - proceedings.mlr.press
Off-policy learning (OPL) aims at finding improved policies from logged bandit data, often by
minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we …

Mixed-effect thompson sampling

I Aouali, B Kveton, S Katariya - International Conference on …, 2023 - proceedings.mlr.press
A contextual bandit is a popular framework for online learning to act under uncertainty. In
practice, the number of actions is huge and their expected rewards are correlated. In this …

Multi-task off-policy learning from bandit feedback

J Hong, B Kveton, M Zaheer… - International …, 2023 - proceedings.mlr.press
Many practical problems involve solving similar tasks. In recommender systems, the tasks
can be users with similar preferences; in search engines, the tasks can be items with similar …

Hierarchical conversational preference elicitation with bandit feedback

J Zuo, S Hu, T Yu, S Li, H Zhao… - Proceedings of the 31st …, 2022 - dl.acm.org
The recent advances of conversational recommendations provide a promising way to
efficiently elicit users' preferences via conversational interactions. To achieve this, the …

Goal-Conditioned Hierarchical Reinforcement Learning With High-Level Model Approximation

Y Luo, T Ji, F Sun, H Liu, J Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Hierarchical reinforcement learning (HRL) exhibits remarkable potential in addressing large-
scale and long-horizon complex tasks. However, a fundamental challenge, which arises …

Thompson sampling with diffusion generative prior

YG Hsieh, SP Kasiviswanathan, B Kveton… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we initiate the idea of using denoising diffusion models to learn priors for online
decision making problems. Our special focus is on the meta-learning for bandit framework …

Prior-Dependent Allocations for Bayesian Fixed-Budget Best-Arm Identification in Structured Bandits

N Nguyen, I Aouali, A György, C Vernade - arxiv preprint arxiv …, 2024 - arxiv.org
We study the problem of Bayesian fixed-budget best-arm identification (BAI) in structured
bandits. We propose an algorithm that uses fixed allocations based on the prior information …

Linear diffusion models meet contextual bandits with large action spaces

I Aouali - 2023 - openreview.net
Efficient exploration is a key challenge in contextual bandits due to the potentially large size
of their action space, where uninformed exploration can result in computational and …

Mixed-Effects Contextual Bandits

K Lee, MC Paik, M Oh, GS Kim - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
We study a novel variant of a contextual bandit problem with multi-dimensional reward
feedback formulated as a mixed-effects model, where the correlations between multiple …

Only pay for what is uncertain: Variance-adaptive thompson sampling

A Saha, B Kveton - arxiv preprint arxiv:2303.09033, 2023 - arxiv.org
Most bandit algorithms assume that the reward variances or their upper bounds are known,
and that they are the same for all arms. This naturally leads to suboptimal performance and …