A tutorial on thompson sampling
Thompson sampling is an algorithm for online decision problems where actions are taken
sequentially in a manner that must balance between exploiting what is known to maximize …
sequentially in a manner that must balance between exploiting what is known to maximize …
[BOOK][B] Bandit algorithms
T Lattimore, C Szepesvári - 2020 - books.google.com
Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …
and the multi-armed bandit model is a commonly used framework to address it. This …
Neural thompson sampling
Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-
armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson …
armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson …
On information gain and regret bounds in gaussian process bandits
Consider the sequential optimization of an expensive to evaluate and possibly non-convex
objective function $ f $ from noisy feedback, that can be considered as a continuum-armed …
objective function $ f $ from noisy feedback, that can be considered as a continuum-armed …
Frequentist regret bounds for randomized least-squares value iteration
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning
(RL). When the state space is large or continuous, traditional tabular approaches are …
(RL). When the state space is large or continuous, traditional tabular approaches are …
Efficient exploration through bayesian deep q-networks
We propose Bayesian Deep Q-Network (BDQN), a practical Thompson sampling based
Reinforcement Learning (RL) Algorithm. Thompson sampling allows for targeted exploration …
Reinforcement Learning (RL) Algorithm. Thompson sampling allows for targeted exploration …
Learning to optimize under non-stationarity
We introduce algorithms that achieve state-of-the-art dynamic regret bounds for non-
stationary linear stochastic bandit setting. It captures natural applications such as dynamic …
stationary linear stochastic bandit setting. It captures natural applications such as dynamic …
Bayesian decision-making under misspecified priors with applications to meta-learning
Thompson sampling and other Bayesian sequential decision-making algorithms are among
the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The …
the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The …
Online (multinomial) logistic bandit: Improved regret and constant computation cost
This paper investigates the logistic bandit problem, a variant of the generalized linear bandit
model that utilizes a logistic model to depict the feedback from an action. While most existing …
model that utilizes a logistic model to depict the feedback from an action. While most existing …
Meta dynamic pricing: Transfer learning across experiments
We study the problem of learning shared structure across a sequence of dynamic pricing
experiments for related products. We consider a practical formulation in which the unknown …
experiments for related products. We consider a practical formulation in which the unknown …