Ucb momentum q-learning: Correcting the bias without forgetting
Abstract We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new
algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic …
algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic …
Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees
We consider reinforcement learning in an environment modeled by an episodic, tabular,
step-dependent Markov decision process of horizon $ H $ with $ S $ states, and $ A …
step-dependent Markov decision process of horizon $ H $ with $ S $ states, and $ A …
Near instance-optimal pac reinforcement learning for deterministic mdps
In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to
identify an $\epsilon $-optimal policy with probability $1-\delta $. While minimax optimal …
identify an $\epsilon $-optimal policy with probability $1-\delta $. While minimax optimal …
Model-based uncertainty in value functions
We consider the problem of quantifying uncertainty over expected cumulative rewards in
model-based reinforcement learning. In particular, we focus on characterizing the variance …
model-based reinforcement learning. In particular, we focus on characterizing the variance …
Model-free posterior sampling via learning rate randomization
In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-
free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the …
free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the …
Value-distributional model-based reinforcement learning
Quantifying uncertainty about a policy's long-term performance is important to solve
sequential decision-making tasks. We study the problem from a model-based Bayesian …
sequential decision-making tasks. We study the problem from a model-based Bayesian …
Online policy optimization for robust mdp
Reinforcement learning (RL) has exceeded human performance in many synthetic settings
such as video games and Go. However, real-world deployment of end-to-end RL models is …
such as video games and Go. However, real-world deployment of end-to-end RL models is …
Bandits corrupted by nature: Lower bounds on regret and robust optimistic algorithm
We study the corrupted bandit problem, ie a stochastic multi-armed bandit problem with $ k $
unknown reward distributions, which are heavy-tailed and corrupted by a history …
unknown reward distributions, which are heavy-tailed and corrupted by a history …
Hybrid Transfer Reinforcement Learning: Provable Sample Efficiency from Shifted-Dynamics Data
Online Reinforcement learning (RL) typically requires high-stakes online interaction data to
learn a policy for a target task. This prompts interest in leveraging historical data to improve …
learn a policy for a target task. This prompts interest in leveraging historical data to improve …
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization
We consider the problem of quantifying uncertainty over expected cumulative rewards in
model-based reinforcement learning. In particular, we focus on characterizing the variance …
model-based reinforcement learning. In particular, we focus on characterizing the variance …