A review of off-policy evaluation in reinforcement learning
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine
learning and has been recently applied to solve a number of challenging problems. In this …
learning and has been recently applied to solve a number of challenging problems. In this …
Counterfactual learning and evaluation for recommender systems: Foundations, implementations, and recent advances
Counterfactual estimators enable the use of existing log data to estimate how some new
target recommendation policy would have performed, if it had been used instead of the …
target recommendation policy would have performed, if it had been used instead of the …
Policy gradient method for robust reinforcement learning
This paper develops the first policy gradient method with global optimality guarantee and
complexity analysis for robust reinforcement learning under model mismatch. Robust …
complexity analysis for robust reinforcement learning under model mismatch. Robust …
Improved sample complexity bounds for distributionally robust reinforcement learning
We consider the problem of learning a control policy that is robust against the parameter
mismatches between the training environment and testing environment. We formulate this as …
mismatches between the training environment and testing environment. We formulate this as …
Distributionally Robust -Learning
Reinforcement learning (RL) has demonstrated remarkable achievements in simulated
environments. However, carrying this success to real environments requires the important …
environments. However, carrying this success to real environments requires the important …
Finite-sample regret bound for distributionally robust offline tabular reinforcement learning
While reinforcement learning has witnessed tremendous success recently in a wide range of
domains, robustness–or the lack thereof–remains an important issue that remains …
domains, robustness–or the lack thereof–remains an important issue that remains …
Doubly robust distributionally robust off-policy evaluation and learning
Off-policy evaluation and learning (OPE/L) use offline observational data to make better
decisions, which is crucial in applications where online experimentation is limited. However …
decisions, which is crucial in applications where online experimentation is limited. However …
Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics
Toward theoretical understandings of robust Markov decision processes: Sample
complexity and asymptotics Page 1 The Annals of Statistics 2022, Vol. 50, No. 6, 3223–3248 …
complexity and asymptotics Page 1 The Annals of Statistics 2022, Vol. 50, No. 6, 3223–3248 …
Pessimistic reward models for off-policy learning in recommendation
Methods for bandit learning from user interactions often require a model of the reward a
certain context-action pair will yield–for example, the probability of a click on a …
certain context-action pair will yield–for example, the probability of a click on a …
Pessimistic decision-making for recommender systems
Modern recommender systems are often modelled under the sequential decision-making
paradigm, where the system decides which recommendations to show in order to maximise …
paradigm, where the system decides which recommendations to show in order to maximise …