Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Z Liu, M Lu, S Zhang, B Liu, H Guo, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning generative models with human preference via RLHF typically suffers from
overoptimization, where an imperfectly learned reward model can misguide the generative …

The curious price of distributional robustness in reinforcement learning with a generative model

L Shi, G Li, Y Wei, Y Chen… - Advances in Neural …, 2023 - proceedings.neurips.cc
This paper investigates model robustness in reinforcement learning (RL) via the framework
of distributionally robust Markov decision processes (RMDPs). Despite recent efforts, the …

Settling the sample complexity of model-based offline reinforcement learning

G Li, L Shi, Y Chen, Y Chi, Y Wei - The Annals of Statistics, 2024 - projecteuclid.org
Settling the sample complexity of model-based offline reinforcement learning Page 1 The
Annals of Statistics 2024, Vol. 52, No. 1, 233–260 https://doi.org/10.1214/23-AOS2342 © …

Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage

J Blanchet, M Lu, T Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
We study distributionally robust offline reinforcement learning (RL), which seeks to find an
optimal robust policy purely from an offline dataset that can perform well in perturbed …

Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity

L Shi, Y Chi - Journal of Machine Learning Research, 2024 - jmlr.org
This paper concerns the central issues of model robustness and sample efficiency in offline
reinforcement learning (RL), which aims to learn to perform decision making from history …

Reinforcement learning with human feedback: Learning dynamic choices via pessimism

Z Li, Z Yang, M Wang - arxiv preprint arxiv:2305.18438, 2023 - arxiv.org
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …

The blessing of heterogeneity in federated q-learning: Linear speedup and beyond

J Woo, G Joshi, Y Chi - International Conference on …, 2023 - proceedings.mlr.press
In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function
by periodically aggregating local Q-estimates trained on local data alone. Focusing on …

Adversarial model for offline reinforcement learning

M Bhardwaj, T **e, B Boots, N Jiang… - Advances in Neural …, 2023 - proceedings.neurips.cc
We propose a novel model-based offline Reinforcement Learning (RL) framework, called
Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn …

Breaking the sample size barrier in model-based reinforcement learning with a generative model

G Li, Y Wei, Y Chi, Y Gu… - Advances in neural …, 2020 - proceedings.neurips.cc
We investigate the sample efficiency of reinforcement learning in a $\gamma $-discounted
infinite-horizon Markov decision process (MDP) with state space S and action space A …

Value-incentivized preference optimization: A unified approach to online and offline rlhf

S Cen, J Mei, K Goshvadi, H Dai, T Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has demonstrated great promise in
aligning large language models (LLMs) with human preference. Depending on the …