Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities
Artificial intelligence (AI) and especially reinforcement learning (RL) have the potential to
enable agents to learn and perform tasks autonomously with superhuman performance …
enable agents to learn and perform tasks autonomously with superhuman performance …
Teaching language models to support answers with verified quotes
Recent large language models often answer factual questions correctly. But users can't trust
any given claim a model makes without fact-checking, because language models can …
any given claim a model makes without fact-checking, because language models can …
[PDF][PDF] Nash learning from human feedback
Large language models (LLMs)(Anil et al., 2023; Glaese et al., 2022; OpenAI, 2023; Ouyang
et al., 2022) have made remarkable strides in enhancing natural language understanding …
et al., 2022) have made remarkable strides in enhancing natural language understanding …
Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training
Conveying complex objectives to reinforcement learning (RL) agents can often be difficult,
involving meticulous design of reward functions that are sufficiently informative yet easy …
involving meticulous design of reward functions that are sufficiently informative yet easy …
Deep reinforcement learning from human preferences
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world
environments, we need to communicate complex goals to these systems. In this work, we …
environments, we need to communicate complex goals to these systems. In this work, we …
Directly fine-tuning diffusion models on differentiable rewards
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-
tuning diffusion models to maximize differentiable reward functions, such as scores from …
tuning diffusion models to maximize differentiable reward functions, such as scores from …
Inverse preference learning: Preference-based rl without a reward function
Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …
based Reinforcement Learning (RL) algorithms address these problems by learning reward …
Contrastive prefence learning: Learning from human feedback without rl
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …
A survey of preference-based reinforcement learning methods
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires …
suitably chosen reward function. However, designing such a reward function often requires …