Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Simulating 500 million years of evolution with a language model

T Hayes, R Rao, H Akin, NJ Sofroniew, D Oktay, Z Lin… - Science, 2025 - science.org
More than three billion years of evolution have produced an image of biology encoded into
the space of natural proteins. Here we show that language models trained at scale on …

[PDF][PDF] Nash learning from human feedback

R Munos, M Valko, D Calandriello, MG Azar… - arxiv preprint arxiv …, 2023 - ai-plans.com
Large language models (LLMs)(Anil et al., 2023; Glaese et al., 2022; OpenAI, 2023; Ouyang
et al., 2022) have made remarkable strides in enhancing natural language understanding …

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W **ong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

A minimaximalist approach to reinforcement learning from human feedback

G Swamy, C Dann, R Kidambi, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

Direct nash optimization: Teaching language models to self-improve with general preferences

C Rosset, CA Cheng, A Mitra, M Santacroce… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Direct language model alignment from online ai feedback

S Guo, B Zhang, T Liu, T Liu, M Khalman… - arxiv preprint arxiv …, 2024 - arxiv.org
Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …

Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - arxiv preprint arxiv:2405.14734, 2024 - arxiv.org
Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …