A review of off-policy evaluation in reinforcement learning

M Uehara, C Shi, N Kallus - arxiv preprint arxiv:2212.06355, 2022 - arxiv.org
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine
learning and has been recently applied to solve a number of challenging problems. In this …

Counterfactual learning and evaluation for recommender systems: Foundations, implementations, and recent advances

Y Saito, T Joachims - Proceedings of the 15th ACM Conference on …, 2021 - dl.acm.org
Counterfactual estimators enable the use of existing log data to estimate how some new
target recommendation policy would have performed, if it had been used instead of the …

Policy gradient method for robust reinforcement learning

Y Wang, S Zou - International conference on machine …, 2022 - proceedings.mlr.press
This paper develops the first policy gradient method with global optimality guarantee and
complexity analysis for robust reinforcement learning under model mismatch. Robust …

Improved sample complexity bounds for distributionally robust reinforcement learning

Z Xu, K Panaganti, D Kalathil - International Conference on …, 2023 - proceedings.mlr.press
We consider the problem of learning a control policy that is robust against the parameter
mismatches between the training environment and testing environment. We formulate this as …

Distributionally Robust -Learning

Z Liu, Q Bai, J Blanchet, P Dong, W Xu… - International …, 2022 - proceedings.mlr.press
Reinforcement learning (RL) has demonstrated remarkable achievements in simulated
environments. However, carrying this success to real environments requires the important …

Finite-sample regret bound for distributionally robust offline tabular reinforcement learning

Z Zhou, Z Zhou, Q Bai, L Qiu… - International …, 2021 - proceedings.mlr.press
While reinforcement learning has witnessed tremendous success recently in a wide range of
domains, robustness–or the lack thereof–remains an important issue that remains …

Doubly robust distributionally robust off-policy evaluation and learning

N Kallus, X Mao, K Wang… - … Conference on Machine …, 2022 - proceedings.mlr.press
Off-policy evaluation and learning (OPE/L) use offline observational data to make better
decisions, which is crucial in applications where online experimentation is limited. However …

Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics

W Yang, L Zhang, Z Zhang - The Annals of Statistics, 2022 - projecteuclid.org
Toward theoretical understandings of robust Markov decision processes: Sample
complexity and asymptotics Page 1 The Annals of Statistics 2022, Vol. 50, No. 6, 3223–3248 …

Pessimistic reward models for off-policy learning in recommendation

O Jeunen, B Goethals - Proceedings of the 15th ACM Conference on …, 2021 - dl.acm.org
Methods for bandit learning from user interactions often require a model of the reward a
certain context-action pair will yield–for example, the probability of a click on a …

Pessimistic decision-making for recommender systems

O Jeunen, B Goethals - ACM Transactions on Recommender Systems, 2023 - dl.acm.org
Modern recommender systems are often modelled under the sequential decision-making
paradigm, where the system decides which recommendations to show in order to maximise …