Improved algorithm for adversarial linear mixture MDPs with bandit feedback and unknown transition
We study reinforcement learning with linear function approximation, unknown transition, and
adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture …
adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture …
Dynamic regret of adversarial MDPs with unknown transition and linear function approximation
We study reinforcement learning (RL) in episodic MDPs with adversarial full-information
losses and the unknown transition. Instead of the classical static regret, we adopt\emph …
losses and the unknown transition. Instead of the classical static regret, we adopt\emph …
Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability
KL-regularized policy optimization has become a workhorse in learning-based decision
making, while its theoretical understanding is still very limited. Although recent progress has …
making, while its theoretical understanding is still very limited. Although recent progress has …
Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards
under full-information feedback, employing dynamic regret as the performance measure. We …
under full-information feedback, employing dynamic regret as the performance measure. We …