Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer
Aligning generative models with human preference via RLHF typically suffers from
overoptimization, where an imperfectly learned reward model can misguide the generative …
overoptimization, where an imperfectly learned reward model can misguide the generative …
The curious price of distributional robustness in reinforcement learning with a generative model
This paper investigates model robustness in reinforcement learning (RL) via the framework
of distributionally robust Markov decision processes (RMDPs). Despite recent efforts, the …
of distributionally robust Markov decision processes (RMDPs). Despite recent efforts, the …
Settling the sample complexity of model-based offline reinforcement learning
Settling the sample complexity of model-based offline reinforcement learning Page 1 The
Annals of Statistics 2024, Vol. 52, No. 1, 233–260 https://doi.org/10.1214/23-AOS2342 © …
Annals of Statistics 2024, Vol. 52, No. 1, 233–260 https://doi.org/10.1214/23-AOS2342 © …
Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage
We study distributionally robust offline reinforcement learning (RL), which seeks to find an
optimal robust policy purely from an offline dataset that can perform well in perturbed …
optimal robust policy purely from an offline dataset that can perform well in perturbed …
Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity
This paper concerns the central issues of model robustness and sample efficiency in offline
reinforcement learning (RL), which aims to learn to perform decision making from history …
reinforcement learning (RL), which aims to learn to perform decision making from history …
Reinforcement learning with human feedback: Learning dynamic choices via pessimism
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …
The blessing of heterogeneity in federated q-learning: Linear speedup and beyond
In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function
by periodically aggregating local Q-estimates trained on local data alone. Focusing on …
by periodically aggregating local Q-estimates trained on local data alone. Focusing on …
Adversarial model for offline reinforcement learning
We propose a novel model-based offline Reinforcement Learning (RL) framework, called
Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn …
Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn …
Breaking the sample size barrier in model-based reinforcement learning with a generative model
We investigate the sample efficiency of reinforcement learning in a $\gamma $-discounted
infinite-horizon Markov decision process (MDP) with state space S and action space A …
infinite-horizon Markov decision process (MDP) with state space S and action space A …
Value-incentivized preference optimization: A unified approach to online and offline rlhf
Reinforcement learning from human feedback (RLHF) has demonstrated great promise in
aligning large language models (LLMs) with human preference. Depending on the …
aligning large language models (LLMs) with human preference. Depending on the …