On the statistical efficiency of reward-free exploration in non-linear rl

J Chen, A Modi, A Krishnamurthy… - Advances in Neural …, 2022‏ - proceedings.neurips.cc
We study reward-free reinforcement learning (RL) under general non-linear function
approximation, and establish sample efficiency and hardness results under various standard …

Future-dependent value-based off-policy evaluation in pomdps

M Uehara, H Kiyohara, A Bennett… - Advances in neural …, 2023‏ - proceedings.neurips.cc
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general
function approximation. Existing methods such as sequential importance sampling …

A primal-dual-critic algorithm for offline constrained reinforcement learning

K Hong, Y Li, A Tewari - International Conference on …, 2024‏ - proceedings.mlr.press
Offline constrained reinforcement learning (RL) aims to learn a policy that maximizes the
expected cumulative reward subject to constraints on expected cumulative cost using an …

Neural network approximation for pessimistic offline reinforcement learning

D Wu, Y Jiao, L Shen, H Yang, X Lu - Proceedings of the AAAI …, 2024‏ - ojs.aaai.org
Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-
making scenarios, yet its theoretical guarantees are still under development. Existing works …

Offline minimax soft-q-learning under realizability and partial coverage

M Uehara, N Kallus, JD Lee… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
We consider offline reinforcement learning (RL) where we only have only access to offline
data. In contrast to numerous offline RL algorithms that necessitate the uniform coverage of …

Ompo: A unified framework for rl under policy and dynamics shifts

Y Luo, T Ji, F Sun, J Zhang, H Xu, X Zhan - arxiv preprint arxiv …, 2024‏ - arxiv.org
Training reinforcement learning policies using environment interaction data collected from
varying policies or dynamics presents a fundamental challenge. Existing works often …

A finite-sample analysis of multi-step temporal difference estimates

Y Duan, MJ Wainwright - Learning for Dynamics and Control …, 2023‏ - proceedings.mlr.press
We consider the problem of estimating the value function of an infinite-horizon $\gamma $-
discounted Markov reward process (MRP). We establish non-asymptotic guarantees for a …

Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

Y Duan, MJ Wainwright - arxiv preprint arxiv:2211.03899, 2022‏ - arxiv.org
We study non-parametric estimation of the value function of an infinite-horizon $\gamma $-
discounted Markov reward process (MRP) using observations from a single trajectory. We …

Offline Learning for Combinatorial Multi-armed Bandits

X Liu, X Dai, J Zuo, S Wang, CJ Wong, J Lui… - arxiv preprint arxiv …, 2025‏ - arxiv.org
The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making
framework, extensively studied over the past decade. However, existing work primarily …

Reinforcement learning under general function approximation and novel interaction settings

J Chen - 2023‏ - ideals.illinois.edu
Reinforcement Learning (RL) is an area of machine learning where an intelligent agent
solves sequential decision-making problems based on experience. Recent advances in the …