Efficient diffusion policies for offline reinforcement learning
Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets,
where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL …
where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL …
Understanding, predicting and better resolving Q-value divergence in offline-RL
The divergence of the Q-value estimation has been a prominent issue offline reinforcement
learning (offline RL), where the agent has no access to real dynamics. Traditional beliefs …
learning (offline RL), where the agent has no access to real dynamics. Traditional beliefs …
Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning
Offline reinforcement learning (RL) aims to learn optimal policies from previously collected
datasets. Recently, due to their powerful representational capabilities, diffusion models have …
datasets. Recently, due to their powerful representational capabilities, diffusion models have …
Exclusively Penalized Q-learning for Offline Reinforcement Learning
J Yeom, Y Jo, J Kim, S Lee, S Han - arxiv preprint arxiv:2405.14082, 2024 - arxiv.org
Constraint-based offline reinforcement learning (RL) involves policy constraints or imposing
penalties on the value function to mitigate overestimation errors caused by distributional …
penalties on the value function to mitigate overestimation errors caused by distributional …
Out-of-Distribution Adaptation in Offline RL: Counterfactual Reasoning via Causal Normalizing Flows
Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online
learning paradigm prevents its widespread adoption, especially in hazardous or costly …
learning paradigm prevents its widespread adoption, especially in hazardous or costly …
UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning
Y Zhang, R Yu, Z Yao, W Zhang, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
The Mean Square Error (MSE) is commonly utilized to estimate the solution of the optimal
value function in the vast majority of offline reinforcement learning (RL) models and has …
value function in the vast majority of offline reinforcement learning (RL) models and has …
Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information
R Tutnov, A Grosnit, H Bou-Ammar - arxiv preprint arxiv:2501.01544, 2025 - arxiv.org
Post-alignment of large language models (LLMs) is critical in improving their utility, safety,
and alignment with human intentions. Direct preference optimisation (DPO) has become one …
and alignment with human intentions. Direct preference optimisation (DPO) has become one …
A Collaborative Perspective on Exploration in Reinforcement Learning
Exploration is one of the central topic in reinforcement learning (RL). Many existing
approaches take a single agent perspective when tackling this problem. In this work, we …
approaches take a single agent perspective when tackling this problem. In this work, we …