Improved algorithm for adversarial linear mixture MDPs with bandit feedback and unknown transition

LF Li, P Zhao, ZH Zhou - International Conference on …, 2024 - proceedings.mlr.press
We study reinforcement learning with linear function approximation, unknown transition, and
adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture …

Dynamic regret of adversarial MDPs with unknown transition and linear function approximation

LF Li, P Zhao, ZH Zhou - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
We study reinforcement learning (RL) in episodic MDPs with adversarial full-information
losses and the unknown transition. Instead of the classical static regret, we adopt\emph …

Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability

Q Zhao, K Ji, H Zhao, T Zhang, Q Gu - arxiv preprint arxiv:2502.06051, 2025 - arxiv.org
KL-regularized policy optimization has become a workhorse in learning-based decision
making, while its theoretical understanding is still very limited. Although recent progress has …

Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs

LF Li, P Zhao, ZH Zhou - arxiv preprint arxiv:2411.03107, 2024 - arxiv.org
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards
under full-information feedback, employing dynamic regret as the performance measure. We …