Recent advances in reinforcement learning in finance

B Hambly, R Xu, H Yang - Mathematical Finance, 2023 - Wiley Online Library
The rapid changes in the finance industry due to the increasing amount of data have
revolutionized the techniques on data processing and data analysis and brought new …

Recent developments in machine learning methods for stochastic control and games

R Hu, M Lauriere - arxiv preprint arxiv:2303.10257, 2023 - arxiv.org
Stochastic optimal control and games have a wide range of applications, from finance and
economics to social sciences, robotics, and energy management. Many real-world …

On the convergence rates of policy gradient methods

L **ao - Journal of Machine Learning Research, 2022 - jmlr.org
We consider infinite-horizon discounted Markov decision problems with finite state and
action spaces and study the convergence rates of the projected policy gradient method and …

Natural policy gradient primal-dual method for constrained markov decision processes

D Ding, K Zhang, T Basar… - Advances in Neural …, 2020 - proceedings.neurips.cc
We study sequential decision-making problems in which each agent aims to maximize the
expected total reward while satisfying a constraint on the expected total utility. We employ …

Policy gradient method for robust reinforcement learning

Y Wang, S Zou - International conference on machine …, 2022 - proceedings.mlr.press
This paper develops the first policy gradient method with global optimality guarantee and
complexity analysis for robust reinforcement learning under model mismatch. Robust …

Online robust reinforcement learning with model uncertainty

Y Wang, S Zou - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc
Robust reinforcement learning (RL) is to find a policy that optimizes the worst-case
performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes

G Lan - Mathematical programming, 2023 - Springer
We present new policy mirror descent (PMD) methods for solving reinforcement learning
(RL) problems with either strongly convex or general convex regularizers. By exploring the …

A finite-time analysis of two time-scale actor-critic methods

YF Wu, W Zhang, P Xu, Q Gu - Advances in Neural …, 2020 - proceedings.neurips.cc
Actor-critic (AC) methods have exhibited great empirical success compared with other
reinforcement learning algorithms, where the actor uses the policy gradient to improve the …

Crpo: A new approach for safe reinforcement learning with convergence guarantee

T Xu, Y Liang, G Lan - International Conference on Machine …, 2021 - proceedings.mlr.press
In safe reinforcement learning (SRL) problems, an agent explores the environment to
maximize an expected total reward and meanwhile avoids violation of certain constraints on …