Efficient diffusion policies for offline reinforcement learning

B Kang, X Ma, C Du, T Pang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets,
where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL …

Understanding, predicting and better resolving Q-value divergence in offline-RL

Y Yue, R Lu, B Kang, S Song… - Advances in Neural …, 2024 - proceedings.neurips.cc
The divergence of the Q-value estimation has been a prominent issue offline reinforcement
learning (offline RL), where the agent has no access to real dynamics. Traditional beliefs …

Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

T Zhang, J Guan, L Zhao, Y Li, D Li, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Offline reinforcement learning (RL) aims to learn optimal policies from previously collected
datasets. Recently, due to their powerful representational capabilities, diffusion models have …

Exclusively Penalized Q-learning for Offline Reinforcement Learning

J Yeom, Y Jo, J Kim, S Lee, S Han - arxiv preprint arxiv:2405.14082, 2024 - arxiv.org
Constraint-based offline reinforcement learning (RL) involves policy constraints or imposing
penalties on the value function to mitigate overestimation errors caused by distributional …

Out-of-Distribution Adaptation in Offline RL: Counterfactual Reasoning via Causal Normalizing Flows

M Cho, JP How, C Sun - arxiv preprint arxiv:2405.03892, 2024 - arxiv.org
Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online
learning paradigm prevents its widespread adoption, especially in hazardous or costly …

UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning

Y Zhang, R Yu, Z Yao, W Zhang, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
The Mean Square Error (MSE) is commonly utilized to estimate the solution of the optimal
value function in the vast majority of offline reinforcement learning (RL) models and has …

Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

R Tutnov, A Grosnit, H Bou-Ammar - arxiv preprint arxiv:2501.01544, 2025 - arxiv.org
Post-alignment of large language models (LLMs) is critical in improving their utility, safety,
and alignment with human intentions. Direct preference optimisation (DPO) has become one …

A Collaborative Perspective on Exploration in Reinforcement Learning

Y Fu, H Zhang, D Wu, W Xu, B Boulet - openreview.net
Exploration is one of the central topic in reinforcement learning (RL). Many existing
approaches take a single agent perspective when tackling this problem. In this work, we …