A definition of continual reinforcement learning

D Abel, A Barreto, B Van Roy… - Advances in …, 2023 - proceedings.neurips.cc
In a standard view of the reinforcement learning problem, an agent's goal is to efficiently
identify a policy that maximizes long-term reward. However, this perspective is based on a …

Distributionally Robust -Learning

Z Liu, Q Bai, J Blanchet, P Dong, W Xu… - International …, 2022 - proceedings.mlr.press
Reinforcement learning (RL) has demonstrated remarkable achievements in simulated
environments. However, carrying this success to real environments requires the important …

Fine-tuning language models with advantage-induced policy alignment

B Zhu, H Sharma, FV Frujeri, S Dong, C Zhu… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach
to aligning large language models (LLMs) to human preferences. Among the plethora of …

Settling the reward hypothesis

M Bowling, JD Martin, D Abel… - … on Machine Learning, 2023 - proceedings.mlr.press
The reward hypothesis posits that," all of what we mean by goals and purposes can be well
thought of as maximization of the expected value of the cumulative sum of a received scalar …

Reinforcement learning: An overview

K Murphy - arxiv preprint arxiv:2412.05265, 2024 - arxiv.org
This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement
learning and sequential decision making, covering value-based RL, policy-gradient …

Continual learning as computationally constrained reinforcement learning

S Kumar, H Marklund, A Rao, Y Zhu, HJ Jeon… - arxiv preprint arxiv …, 2023 - arxiv.org
An agent that efficiently accumulates knowledge to develop increasingly sophisticated skills
over a long lifetime could advance the frontier of artificial intelligence capabilities. The …

Partially Observed Optimal Stochastic Control: Regularity, Optimality, Approximations, and Learning

AD Kara, S Yuksel - arxiv preprint arxiv:2412.06735, 2024 - arxiv.org
In this review/tutorial article, we present recent progress on optimal control of partially
observed Markov Decision Processes (POMDPs). We first present regularity and continuity …

Deciding what to model: Value-equivalent sampling for reinforcement learning

D Arumugam, B Van Roy - Advances in neural information …, 2022 - proceedings.neurips.cc
The quintessential model-based reinforcement-learning agent iteratively refines its
estimates or prior beliefs about the true underlying model of the environment. Recent …

Three dogmas of reinforcement learning

D Abel, MK Ho, A Harutyunyan - arxiv preprint arxiv:2407.10583, 2024 - arxiv.org
Modern reinforcement learning has been conditioned by at least three dogmas. The first is
the environment spotlight, which refers to our tendency to focus on modeling environments …

Satisficing exploration for deep reinforcement learning

D Arumugam, S Kumar, R Gummadi… - arxiv preprint arxiv …, 2024 - arxiv.org
A default assumption in the design of reinforcement-learning algorithms is that a decision-
making agent always explores to learn optimal behavior. In sufficiently complex …