Stochastic multi-armed-bandit problem with non-stationary rewards

O Besbes, Y Gur, A Zeevi - Advances in neural information …, 2014 - proceedings.neurips.cc
In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play
one of K arms, each characterized by an unknown reward distribution. Reward realizations …

Learning to optimize under non-stationarity

WC Cheung, D Simchi-Levi… - The 22nd International …, 2019 - proceedings.mlr.press
We introduce algorithms that achieve state-of-the-art dynamic regret bounds for non-
stationary linear stochastic bandit setting. It captures natural applications such as dynamic …

Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism

WC Cheung, D Simchi-Levi… - … conference on machine …, 2020 - proceedings.mlr.press
We consider un-discounted reinforcement learning (RL) in Markov decision processes
(MDPs) under drifting non-stationarity,\ie, both the reward and state transition distributions …

A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free

Y Chen, CW Lee, H Luo… - Conference on Learning …, 2019 - proceedings.mlr.press
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal
in terms of dynamic regret. Specifically, our algorithm achieves $\mathcal {O}(\min\{\sqrt …

Near-optimal model-free reinforcement learning in non-stationary episodic mdps

W Mao, K Zhang, R Zhu… - … on Machine Learning, 2021 - proceedings.mlr.press
We consider model-free reinforcement learning (RL) in non-stationary Markov decision
processes. Both the reward functions and the state transition functions are allowed to vary …

Hedging the drift: Learning to optimize under nonstationarity

WC Cheung, D Simchi-Levi, R Zhu - Management Science, 2022 - pubsonline.informs.org
We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic
regret bounds for a collection of nonstationary stochastic bandit settings. These settings …

Efficient contextual bandits in non-stationary worlds

H Luo, CY Wei, A Agarwal… - Conference On Learning …, 2018 - proceedings.mlr.press
Most contextual bandit algorithms minimize regret against the best fixed policy, a
questionable benchmark for non-stationary environments that are ubiquitous in applications …

Dynamic regret of policy optimization in non-stationary environments

Y Fei, Z Yang, Z Wang, Q **e - Advances in Neural …, 2020 - proceedings.neurips.cc
We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information
reward feedback and unknown fixed transition kernels. We propose two model-free policy …

Non-stationary experimental design under linear trends

D Simchi-Levi, C Wang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Experimentation has been critical and increasingly popular across various domains, such as
clinical trials and online platforms, due to its widely recognized benefits. One of the primary …

Non-stationary reinforcement learning under general function approximation

S Feng, M Yin, R Huang, YX Wang… - International …, 2023 - proceedings.mlr.press
General function approximation is a powerful tool to handle large state and action spaces in
a broad range of reinforcement learning (RL) scenarios. However, theoretical understanding …