Is rlhf more difficult than standard rl? a theoretical perspective

Y Wang, Q Liu, C ** - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …

Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning

F Chen, S Mei, Y Bai - arxiv preprint arxiv:2209.11745, 2022 - arxiv.org
Finding unified complexity measures and algorithms for sample-efficient learning is a central
topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) …

Making rl with preference-based feedback efficient via randomization

R Wu, W Sun - arxiv preprint arxiv:2310.14554, 2023 - arxiv.org
Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be
efficient in terms of statistical complexity, computational complexity, and query complexity. In …

Lower bounds for learning in revealing POMDPs

F Chen, H Wang, C **ong, S Mei… - … Conference on Machine …, 2023 - proceedings.mlr.press
This paper studies the fundamental limits of reinforcement learning (RL) in the challenging
partially observable setting. While it is well-established that learning in Partially Observable …

On the complexity of multi-agent decision making: From learning in games to partial monitoring

D Foster, DJ Foster, N Golowich… - The Thirty Sixth …, 2023 - proceedings.mlr.press
A central problem in the theory of multi-agent reinforcement learning (MARL) is to
understand what structural conditions and algorithmic principles lead to sample-efficient …

Posterior sampling for competitive RL: function approximation and partial observation

S Qiu, Z Dai, H Zhong, Z Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper investigates posterior sampling algorithms for competitive reinforcement learning
(RL) in the context of general function approximations. Focusing on zero-sum Markov games …

Partially observable rl with b-stability: Unified structural condition and sharp sample-efficient algorithms

F Chen, Y Bai, S Mei - arxiv preprint arxiv:2209.14990, 2022 - arxiv.org
Partial Observability--where agents can only observe partial information about the true
underlying state of the system--is ubiquitous in real-world applications of Reinforcement …

Optimistic policy gradient in multi-player markov games with a single controller: Convergence beyond the minty property

I Anagnostides, I Panageas, G Farina… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Policy gradient methods enjoy strong practical performance in numerous tasks in
reinforcement learning. Their theoretical understanding in multiagent settings, however …

Provably efficient ucb-type algorithms for learning predictive state representations

R Huang, Y Liang, J Yang - arxiv preprint arxiv:2307.00405, 2023 - arxiv.org
The general sequential decision-making problem, which includes Markov decision
processes (MDPs) and partially observable MDPs (POMDPs) as special cases, aims at …

Less is more: Robust robot learning via partially observable multi-agent reinforcement learning

W Zhao, EA Rantala, J Pajarinen… - arxiv preprint arxiv …, 2023 - arxiv.org
In many multi-agent and high-dimensional robotic tasks, the controller can be designed in
either a centralized or decentralized way. Correspondingly, it is possible to use either single …