Feel-good thompson sampling for contextual bandits and reinforcement learning

T Zhang - SIAM Journal on Mathematics of Data Science, 2022 - SIAM
Thompson sampling has been widely used for contextual bandit problems due to the
flexibility of its modeling power. However, a general theory for this class of methods in the …

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arxiv preprint arxiv:2310.14554, 2023 - arxiv.org
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

A self-play posterior sampling algorithm for zero-sum markov games

W **ong, H Zhong, C Shi, C Shen… - … on Machine Learning, 2022 - proceedings.mlr.press
Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively
build on the “optimism in the face of uncertainty”(OFU) principle. This work focuses on a …

A provably efficient model-free posterior sampling method for episodic reinforcement learning

C Dann, M Mohri, T Zhang… - Advances in Neural …, 2021 - proceedings.neurips.cc
Thompson Sampling is one of the most effective methods for contextual bandits and has
been generalized to posterior sampling for certain MDP settings. However, existing posterior …

Posterior sampling with delayed feedback for reinforcement learning with linear function approximation

NL Kuang, M Yin, M Wang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent studies in reinforcement learning (RL) have made significant progress by leveraging
function approximation to alleviate the sample complexity hurdle for better performance …

Randomized exploration in reinforcement learning with general value function approximation

H Ishfaq, Q Cui, V Nguyen, A Ayoub… - International …, 2021 - proceedings.mlr.press
We propose a model-free reinforcement learning algorithm inspired by the popular
randomized least squares value iteration (RLSVI) algorithm as well as the optimism …

Towards deployment-efficient reinforcement learning: Lower bound and optimality

J Huang, J Chen, L Zhao, T Qin, N Jiang… - arxiv preprint arxiv …, 2022 - arxiv.org
Deployment efficiency is an important criterion for many real-world applications of
reinforcement learning (RL). Despite the community's increasing interest, there lacks a …

Nonstationary reinforcement learning with linear function approximation

H Zhou, J Chen, LR Varshney, A Jagmohan - arxiv preprint arxiv …, 2020 - arxiv.org
We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs)
with linear function approximation under drifting environment. Specifically, both the reward …

Optimistic Thompson sampling-based algorithms for episodic reinforcement learning

B Hu, TH Zhang, N Hegde… - Uncertainty in Artificial …, 2023 - proceedings.mlr.press
Abstract We propose two Thompson Sampling-like, model-based learning algorithms for
episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed …

Dyadic Reinforcement Learning

S Li, LS Niell, SW Choi, I Nahum-Shani… - arxiv preprint arxiv …, 2023 - arxiv.org
Mobile health aims to enhance health outcomes by delivering interventions to individuals as
they go about their daily life. The involvement of care partners and social support networks …