A review of off-policy evaluation in reinforcement learning

M Uehara, C Shi, N Kallus - arxiv preprint arxiv:2212.06355, 2022 - arxiv.org
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine
learning and has been recently applied to solve a number of challenging problems. In this …

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

Offline reinforcement learning as one big sequence modeling problem

M Janner, Q Li, S Levine - Advances in neural information …, 2021 - proceedings.neurips.cc
Reinforcement learning (RL) is typically viewed as the problem of estimating single-step
policies (for model-free RL) or single-step models (for model-based RL), leveraging the …

Supervised pretraining can learn in-context reinforcement learning

J Lee, A **e, A Pacchiano, Y Chandak… - Advances in …, 2024 - proceedings.neurips.cc
Large transformer models trained on diverse datasets have shown a remarkable ability to
learn in-context, achieving high few-shot performance on tasks they were not explicitly …

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

M Nakamoto, S Zhai, A Singh… - Advances in …, 2024 - proceedings.neurips.cc
A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization
from existing datasets followed by fast online fine-tuning with limited interaction. However …

Combo: Conservative offline model-based policy optimization

T Yu, A Kumar, R Rafailov… - Advances in neural …, 2021 - proceedings.neurips.cc
Abstract Model-based reinforcement learning (RL) algorithms, which learn a dynamics
model from logged experience and perform conservative planning under the learned model …

Bellman-consistent pessimism for offline reinforcement learning

T **e, CA Cheng, N Jiang, P Mineiro… - Advances in neural …, 2021 - proceedings.neurips.cc
The use of pessimism, when reasoning about datasets lacking exhaustive exploration has
recently gained prominence in offline reinforcement learning. Despite the robustness it adds …

Bridging offline reinforcement learning and imitation learning: A tale of pessimism

P Rashidinejad, B Zhu, C Ma, J Jiao… - Advances in Neural …, 2021 - proceedings.neurips.cc
Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from
a fixed dataset without active data collection. Based on the composition of the offline dataset …

Adversarially trained actor critic for offline reinforcement learning

CA Cheng, T **e, N Jiang… - … Conference on Machine …, 2022 - proceedings.mlr.press
Abstract We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm
for offline reinforcement learning (RL) under insufficient data coverage, based on the …

Offline reinforcement learning with realizability and single-policy concentrability

W Zhan, B Huang, A Huang… - … on Learning Theory, 2022 - proceedings.mlr.press
Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong
assumptions on both the function classes (eg, Bellman-completeness) and the data …