A tutorial on thompson sampling

DJ Russo, B Van Roy, A Kazerouni… - … and Trends® in …, 2018 - nowpublishers.com
Thompson sampling is an algorithm for online decision problems where actions are taken
sequentially in a manner that must balance between exploiting what is known to maximize …

[BOOK][B] Bandit algorithms

T Lattimore, C Szepesvári - 2020 - books.google.com
Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …

Neural thompson sampling

W Zhang, D Zhou, L Li, Q Gu - arxiv preprint arxiv:2010.00827, 2020 - arxiv.org
Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-
armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson …

On information gain and regret bounds in gaussian process bandits

S Vakili, K Khezeli, V Picheny - International Conference on …, 2021 - proceedings.mlr.press
Consider the sequential optimization of an expensive to evaluate and possibly non-convex
objective function $ f $ from noisy feedback, that can be considered as a continuum-armed …

Frequentist regret bounds for randomized least-squares value iteration

A Zanette, D Brandfonbrener… - International …, 2020 - proceedings.mlr.press
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning
(RL). When the state space is large or continuous, traditional tabular approaches are …

Efficient exploration through bayesian deep q-networks

K Azizzadenesheli, E Brunskill… - 2018 Information …, 2018 - ieeexplore.ieee.org
We propose Bayesian Deep Q-Network (BDQN), a practical Thompson sampling based
Reinforcement Learning (RL) Algorithm. Thompson sampling allows for targeted exploration …

Learning to optimize under non-stationarity

WC Cheung, D Simchi-Levi… - The 22nd International …, 2019 - proceedings.mlr.press
We introduce algorithms that achieve state-of-the-art dynamic regret bounds for non-
stationary linear stochastic bandit setting. It captures natural applications such as dynamic …

Bayesian decision-making under misspecified priors with applications to meta-learning

M Simchowitz, C Tosh… - Advances in …, 2021 - proceedings.neurips.cc
Thompson sampling and other Bayesian sequential decision-making algorithms are among
the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The …

Online (multinomial) logistic bandit: Improved regret and constant computation cost

YJ Zhang, M Sugiyama - Advances in Neural Information …, 2024 - proceedings.neurips.cc
This paper investigates the logistic bandit problem, a variant of the generalized linear bandit
model that utilizes a logistic model to depict the feedback from an action. While most existing …

Meta dynamic pricing: Transfer learning across experiments

H Bastani, D Simchi-Levi, R Zhu - Management Science, 2022 - pubsonline.informs.org
We study the problem of learning shared structure across a sequence of dynamic pricing
experiments for related products. We consider a practical formulation in which the unknown …