Reinforced self-training (rest) for language modeling

C Gulcehre, TL Paine, S Srinivasan… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) can improve the quality of large
language model's (LLM) outputs by aligning them with human preferences. We propose a …

Large language models play starcraft ii: Benchmarks and a chain of summarization approach

W Ma, Q Mi, Y Zeng, X Yan, R Lin… - Advances in …, 2025 - proceedings.neurips.cc
With the continued advancement of Large Language Models (LLMs) Agents in reasoning,
planning, and decision-making, benchmarks have become crucial in evaluating these skills …

Efficient diffusion policies for offline reinforcement learning

B Kang, X Ma, C Du, T Pang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets,
where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL …

Plan better amid conservatism: Offline multi-agent reinforcement learning with actor rectification

L Pan, L Huang, T Ma, H Xu - International conference on …, 2022 - proceedings.mlr.press
Conservatism has led to significant progress in offline reinforcement learning (RL) where an
agent learns from pre-collected datasets. However, as many real-world scenarios involve …

Large-scale retrieval for reinforcement learning

P Humphreys, A Guez, O Tieleman… - Advances in …, 2022 - proceedings.neurips.cc
Effective decision making involves flexibly relating past experiences and relevant contextual
information to a novel situation. In deep reinforcement learning (RL), the dominant paradigm …

Hokoff: Real game dataset from honor of kings and its offline reinforcement learning benchmarks

Y Qu, B Wang, J Shao, Y Jiang… - Advances in …, 2023 - proceedings.neurips.cc
Abstract The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent
Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre …

An empirical study of implicit regularization in deep offline rl

C Gulcehre, S Srinivasan, J Sygnowski… - arxiv preprint arxiv …, 2022 - arxiv.org
Deep neural networks are the most commonly used function approximators in offline
reinforcement learning. Prior works have shown that neural nets trained with TD-learning …

Learning to reach goals via diffusion

V Jain, S Ravanbakhsh - arxiv preprint arxiv:2310.02505, 2023 - arxiv.org
We present a novel perspective on goal-conditioned reinforcement learning by framing it
within the context of denoising diffusion models. Analogous to the diffusion process, where …

A new approach to solving smac task: Generating decision tree code from large language models

Y Deng, W Ma, Y Fan, Y Zhang, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
StarCraft Multi-Agent Challenge (SMAC) is one of the most commonly used experimental
environments in multi-agent reinforcement learning (MARL), where the specific task is to …

Guided Proximal Policy Optimization with Structured Action Graph for Complex Decision-making

Y Yang, D **ng, W **a, P Wang - Machine Intelligence Research, 2025 - Springer
Reinforcement learning encounters formidable challenges when tasked with intricate
decision-making scenarios, primarily due to the expansive parameterized action spaces and …