Nearly minimax optimal reinforcement learning for linear markov decision processes

J He, H Zhao, D Zhou, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
We study reinforcement learning (RL) with linear function approximation. For episodic time-
inhomogeneous linear Markov decision processes (linear MDPs) whose transition …

A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes

H Zhong, T Zhang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
The proximal policy optimization (PPO) algorithm stands as one of the most prosperous
methods in the field of reinforcement learning (RL). Despite its success, the theoretical …

VOL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

A Agarwal, Y **, T Zhang - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press
We study time-inhomogeneous episodic reinforcement learning (RL) under general function
approximation and sparse rewards. We design a new algorithm, Variance-weighted …

Corruption-robust offline reinforcement learning with general function approximation

C Ye, R Yang, Q Gu, T Zhang - Advances in Neural …, 2023 - proceedings.neurips.cc
We investigate the problem of corruption robustness in offline reinforcement learning (RL)
with general function approximation, where an adversary can corrupt each sample in the …

Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency

H Zhao, J He, D Zhou, T Zhang… - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press
Recently, several studies\citep {zhou2021nearly, zhang2021variance, kim2021improved,
zhou2022computationally} have provided variance-dependent regret bounds for linear …

Reinforcement learning from human feedback with active queries

K Ji, J He, Q Gu - arxiv preprint arxiv:2402.09401, 2024 - arxiv.org
Aligning large language models (LLM) with human preference plays a key role in building
modern generative models and can be achieved by reinforcement learning from human …

Noise-adaptive thompson sampling for linear contextual bandits

R Xu, Y Min, T Wang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Linear contextual bandits represent a fundamental class of models with numerous real-
world applications, and it is critical to develop algorithms that can effectively manage noise …

Cooperative multi-agent reinforcement learning: asynchronous communication and linear function approximation

Y Min, J He, T Wang, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
We study multi-agent reinforcement learning in the setting of episodic Markov decision
processes, where many agents cooperate via communication through a central server. We …

A nearly optimal and low-switching algorithm for reinforcement learning with general function approximation

H Zhao, J He, Q Gu - arxiv preprint arxiv:2311.15238, 2023 - arxiv.org
The exploration-exploitation dilemma has been a central challenge in reinforcement
learning (RL) with complex model classes. In this paper, we propose a new algorithm …

Tackling heavy-tailed rewards in reinforcement learning with function approximation: Minimax optimal and instance-dependent regret bounds

J Huang, H Zhong, L Wang… - Advances in Neural …, 2023 - proceedings.neurips.cc
While numerous works have focused on devising efficient algorithms for reinforcement
learning (RL) with uniformly bounded rewards, it remains an open question whether sample …