Provably efficient exploration in policy optimization
While policy-based reinforcement learning (RL) achieves tremendous successes in practice,
it is significantly less understood in theory, especially compared with value-based RL. In …
it is significantly less understood in theory, especially compared with value-based RL. In …
[書籍][B] Algorithms for reinforcement learning
C Szepesvári - 2022 - books.google.com
Reinforcement learning is a learning paradigm concerned with learning to control a system
so as to maximize a numerical performance measure that expresses a long-term objective …
so as to maximize a numerical performance measure that expresses a long-term objective …
Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach
We propose a black-box reduction that turns a certain reinforcement learning algorithm with
optimal regret in a (near-) stationary environment into another algorithm with optimal …
optimal regret in a (near-) stationary environment into another algorithm with optimal …
A unified view of entropy-regularized markov decision processes
We propose a general framework for entropy-regularized average-reward reinforcement
learning in Markov decision processes (MDPs). Our approach is based on extending the …
learning in Markov decision processes (MDPs). Our approach is based on extending the …
Corruption-robust offline reinforcement learning with general function approximation
We investigate the problem of corruption robustness in offline reinforcement learning (RL)
with general function approximation, where an adversary can corrupt each sample in the …
with general function approximation, where an adversary can corrupt each sample in the …
Learning adversarial markov decision processes with bandit feedback and unknown transition
We consider the task of learning in episodic finite-horizon Markov decision processes with
an unknown transition function, bandit feedback, and adversarial losses. We propose an …
an unknown transition function, bandit feedback, and adversarial losses. We propose an …
Online convex optimization in adversarial markov decision processes
We consider online learning in episodic loop-free Markov decision processes (MDPs),
where the loss function can change arbitrarily between episodes, and the transition function …
where the loss function can change arbitrarily between episodes, and the transition function …
Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes
Despite the significant interest and progress in reinforcement learning (RL) problems with
adversarial corruption, current works are either confined to the linear setting or lead to an …
adversarial corruption, current works are either confined to the linear setting or lead to an …
Combinatorial pure exploration of multi-armed bandits
We study the {\em combinatorial pure exploration (CPE)} problem in the stochastic multi-
armed bandit setting, where a learner explores a set of arms with the objective of identifying …
armed bandit setting, where a learner explores a set of arms with the objective of identifying …
A model selection approach for corruption robust reinforcement learning
We develop a model selection approach to tackle reinforcement learning with adversarial
corruption in both transition and reward. For finite-horizon tabular MDPs, without prior …
corruption in both transition and reward. For finite-horizon tabular MDPs, without prior …