Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Simulating 500 million years of evolution with a language model
More than three billion years of evolution have produced an image of biology encoded into
the space of natural proteins. Here we show that language models trained at scale on …
the space of natural proteins. Here we show that language models trained at scale on …
[PDF][PDF] Nash learning from human feedback
Large language models (LLMs)(Anil et al., 2023; Glaese et al., 2022; OpenAI, 2023; Ouyang
et al., 2022) have made remarkable strides in enhancing natural language understanding …
et al., 2022) have made remarkable strides in enhancing natural language understanding …
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …
A minimaximalist approach to reinforcement learning from human feedback
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …
learning from human feedback. Our approach is minimalist in that it does not require training …
Direct nash optimization: Teaching language models to self-improve with general preferences
This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …
from a powerful oracle to help a model iteratively improve over itself. The typical approach …
A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Dpo meets ppo: Reinforced token optimization for rlhf
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Direct language model alignment from online ai feedback
Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …
Simpo: Simple preference optimization with a reference-free reward
Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …
algorithm that reparameterizes reward functions in reinforcement learning from human …