Efficient and robust sequential decision making algorithms

P Xu - AI Magazine, 2024 - Wiley Online Library
Sequential decision‐making involves making informed decisions based on continuous
interactions with a complex environment. This process is ubiquitous in various applications …

Kullback-leibler maillard sampling for multi-armed bandits with bounded rewards

H Qin, KS Jun, C Zhang - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We study $ K $-armed bandit problems where the reward distributions of the arms are all
supported on the $[0, 1] $ interval. Maillard sampling\cite {maillard13apprentissage}, an …

A general recipe for the analysis of randomized multi-armed bandit algorithms

D Baudry, K Suzuki, J Honda - arxiv preprint arxiv:2303.06058, 2023 - arxiv.org
In this paper we propose a general methodology to derive regret bounds for randomized
multi-armed bandit algorithms. It consists in checking a set of sufficient conditions on the …

Monte-Carlo tree search with uncertainty propagation via optimal transport

T Dam, P Stenger, L Schneider, J Pajarinen… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper introduces a novel backup strategy for Monte-Carlo Tree Search (MCTS)
designed for highly stochastic and partially observable Markov decision processes. We …

Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

HL Hsu, W Wang, M Pajic, P Xu - arxiv preprint arxiv:2404.10728, 2024 - arxiv.org
We present the first study on provably efficient randomized exploration in cooperative multi-
agent reinforcement learning (MARL). We propose a unified algorithm framework for …

[HTML][HTML] Thompson Sampling for Non-Stationary Bandit Problems

H Qi, F Guo, L Zhu - Entropy, 2025 - mdpi.com
Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive
attention. We focus on the abruptly changing scenario where reward distributions remain …

Zero-Inflated Bandits

H Wei, R Wan, L Shi, R Song - arxiv preprint arxiv:2312.15595, 2023 - arxiv.org
Many real applications of bandits have sparse non-zero rewards, leading to slow learning
rates. A careful distribution modeling that utilizes problem-specific structures is known as …