Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A minimaximalist approach to reinforcement learning from human feedback
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …
learning from human feedback. Our approach is minimalist in that it does not require training …
Contrastive preference learning: learning from human feedback without rl
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …
Scaling laws for reward model overoptimization in direct alignment algorithms
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent
success of Large Language Models (LLMs), however, it is often a complex and brittle …
success of Large Language Models (LLMs), however, it is often a complex and brittle …
Dual rl: Unification and new methods for reinforcement and imitation learning
The goal of reinforcement learning (RL) is to find a policy that maximizes the expected
cumulative return. It has been shown that this objective can be represented as an …
cumulative return. It has been shown that this objective can be represented as an …
Robot air hockey: A manipulation testbed for robot learning with reinforcement learning
Reinforcement Learning is a promising tool for learning complex policies even in fast-
moving and object-interactive domains where human teleoperation or hard-coded policies …
moving and object-interactive domains where human teleoperation or hard-coded policies …
A dual representation framework for robot learning with human guidance
The ability to interactively learn skills from human guidance and adjust behavior according
to human preference is crucial to accelerating robot learning. But human guidance is an …
to human preference is crucial to accelerating robot learning. But human guidance is an …
Trajectory improvement and reward learning from comparative language feedback
Learning from human feedback has gained traction in fields like robotics and natural
language processing in recent years. While prior works mostly rely on human feedback in …
language processing in recent years. While prior works mostly rely on human feedback in …
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human
preferences via a reward function learned from binary feedback over agent behaviors. We …
preferences via a reward function learned from binary feedback over agent behaviors. We …
SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve
multiple goals in an environment purely from offline datasets using sparse reward functions …
multiple goals in an environment purely from offline datasets using sparse reward functions …
Imitation from arbitrary experience: A dual unification of reinforcement and imitation learning methods
It is well known that Reinforcement Learning (RL) can be formulated as a convex program
with linear constraints. The dual form of this formulation is unconstrained, which we refer to …
with linear constraints. The dual form of this formulation is unconstrained, which we refer to …