Nearly minimax optimal reinforcement learning for linear mixture markov decision processes

D Zhou, Q Gu, C Szepesvari - Conference on Learning …, 2021 - proceedings.mlr.press
We study reinforcement learning (RL) with linear function approximation where the
underlying transition probability kernel of the Markov decision process (MDP) is a linear …

Provably efficient exploration in policy optimization

Q Cai, Z Yang, C **, Z Wang - International Conference on …, 2020 - proceedings.mlr.press
While policy-based reinforcement learning (RL) achieves tremendous successes in practice,
it is significantly less understood in theory, especially compared with value-based RL. In …

Nearly minimax optimal reinforcement learning for linear markov decision processes

J He, H Zhao, D Zhou, Q Gu - International Conference on …, 2023 - proceedings.mlr.press
We study reinforcement learning (RL) with linear function approximation. For episodic time-
inhomogeneous linear Markov decision processes (linear MDPs) whose transition …

Unpacking reward sha**: Understanding the benefits of reward engineering on sample complexity

A Gupta, A Pacchiano, Y Zhai… - Advances in Neural …, 2022 - proceedings.neurips.cc
The success of reinforcement learning in a variety of challenging sequential decision-
making problems has been much discussed, but often ignored in this discussion is the …

Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation

X Chen, H Zhong, Z Yang, Z Wang… - … on Machine Learning, 2022 - proceedings.mlr.press
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where
instead of receiving a numeric reward at each step, the RL agent only receives preferences …

Provably efficient safe exploration via primal-dual policy optimization

D Ding, X Wei, Z Yang, Z Wang… - … conference on artificial …, 2021 - proceedings.mlr.press
We study the safe reinforcement learning problem using the constrained Markov decision
processes in which an agent aims to maximize the expected total reward subject to a safety …

Leveraging offline data in online reinforcement learning

A Wagenmaker, A Pacchiano - International Conference on …, 2023 - proceedings.mlr.press
Two central paradigms have emerged in the reinforcement learning (RL) community: online
RL and offline RL. In the online RL setting, the agent has no prior knowledge of the …

Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension

R Wang, RR Salakhutdinov… - Advances in Neural …, 2020 - proceedings.neurips.cc
Value function approximation has demonstrated phenomenal empirical success in
reinforcement learning (RL). Nevertheless, despite a handful of recent progress on …

A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes

H Zhong, T Zhang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
The proximal policy optimization (PPO) algorithm stands as one of the most prosperous
methods in the field of reinforcement learning (RL). Despite its success, the theoretical …

Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency

Z Liu, H Hu, S Zhang, H Guo, S Ke, B Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating
reasoning into actions in the real world remains challenging. In particular, it remains unclear …