Google Académico

I Aouali, VE Brunel, D Rohde… - … Conference on Machine …, 2023 - proceedings.mlr.press

Off-policy learning (OPL) aims at finding improved policies from logged bandit data, often by
minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we …

Guardar Citar Citado por 13 Artículos relacionados Las 13 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Mixed-effect thompson sampling

I Aouali, B Kveton, S Katariya - International Conference on …, 2023 - proceedings.mlr.press

A contextual bandit is a popular framework for online learning to act under uncertainty. In
practice, the number of actions is huge and their expected rewards are correlated. In this …

Guardar Citar Citado por 17 Artículos relacionados Las 7 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Multi-task off-policy learning from bandit feedback

J Hong, B Kveton, M Zaheer… - International …, 2023 - proceedings.mlr.press

Many practical problems involve solving similar tasks. In recommender systems, the tasks
can be users with similar preferences; in search engines, the tasks can be items with similar …

Guardar Citar Citado por 6 Artículos relacionados Las 8 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Hierarchical conversational preference elicitation with bandit feedback

J Zuo, S Hu, T Yu, S Li, H Zhao… - Proceedings of the 31st …, 2022 - dl.acm.org

The recent advances of conversational recommendations provide a promising way to
efficiently elicit users' preferences via conversational interactions. To achieve this, the …

Guardar Citar Citado por 10 Artículos relacionados Las 4 versiones

Goal-Conditioned Hierarchical Reinforcement Learning With High-Level Model Approximation

Y Luo, T Ji, F Sun, H Liu, J Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Hierarchical reinforcement learning (HRL) exhibits remarkable potential in addressing large-
scale and long-horizon complex tasks. However, a fundamental challenge, which arises …

Guardar Citar Citado por 5 Artículos relacionados Las 3 versiones

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Thompson sampling with diffusion generative prior

YG Hsieh, SP Kasiviswanathan, B Kveton… - arxiv preprint arxiv …, 2023 - arxiv.org

In this work, we initiate the idea of using denoising diffusion models to learn priors for online
decision making problems. Our special focus is on the meta-learning for bandit framework …

Guardar Citar Citado por 6 Artículos relacionados Las 6 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Prior-Dependent Allocations for Bayesian Fixed-Budget Best-Arm Identification in Structured Bandits

N Nguyen, I Aouali, A György, C Vernade - arxiv preprint arxiv …, 2024 - arxiv.org

We study the problem of Bayesian fixed-budget best-arm identification (BAI) in structured
bandits. We propose an algorithm that uses fixed allocations based on the prior information …

Guardar Citar Citado por 2 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Linear diffusion models meet contextual bandits with large action spaces

I Aouali - 2023 - openreview.net

Efficient exploration is a key challenge in contextual bandits due to the potentially large size
of their action space, where uninformed exploration can result in computational and …

Guardar Citar Citado por 3 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Mixed-Effects Contextual Bandits

K Lee, MC Paik, M Oh, GS Kim - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org

We study a novel variant of a contextual bandit problem with multi-dimensional reward
feedback formulated as a mixed-effects model, where the correlations between multiple …

Guardar Citar Citado por 1 Artículos relacionados Versión en HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Only pay for what is uncertain: Variance-adaptive thompson sampling

A Saha, B Kveton - arxiv preprint arxiv:2303.09033, 2023 - arxiv.org

Most bandit algorithms assume that the reward variances or their upper bounds are known,
and that they are the same for all arms. This naturally leads to suboptimal performance and …

Guardar Citar Citado por 4 Artículos relacionados Las 3 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Deep hierarchy in bandits

Exponential smoothing for off-policy learning

Mixed-effect thompson sampling

Multi-task off-policy learning from bandit feedback

Hierarchical conversational preference elicitation with bandit feedback

Goal-Conditioned Hierarchical Reinforcement Learning With High-Level Model Approximation

Thompson sampling with diffusion generative prior

Prior-Dependent Allocations for Bayesian Fixed-Budget Best-Arm Identification in Structured Bandits

Linear diffusion models meet contextual bandits with large action spaces

Mixed-Effects Contextual Bandits

Only pay for what is uncertain: Variance-adaptive thompson sampling