Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities

CO Retzlaff, S Das, C Wayllace, P Mousavi… - Journal of Artificial …, 2024 - jair.org
Artificial intelligence (AI) and especially reinforcement learning (RL) have the potential to
enable agents to learn and perform tasks autonomously with superhuman performance …

Teaching language models to support answers with verified quotes

J Menick, M Trebacz, V Mikulik, J Aslanides… - arxiv preprint arxiv …, 2022 - arxiv.org
Recent large language models often answer factual questions correctly. But users can't trust
any given claim a model makes without fact-checking, because language models can …

[PDF][PDF] Nash learning from human feedback

R Munos, M Valko, D Calandriello, MG Azar… - arxiv preprint arxiv …, 2023 - ai-plans.com
Large language models (LLMs)(Anil et al., 2023; Glaese et al., 2022; OpenAI, 2023; Ouyang
et al., 2022) have made remarkable strides in enhancing natural language understanding …

Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training

K Lee, L Smith, P Abbeel - arxiv preprint arxiv:2106.05091, 2021 - arxiv.org
Conveying complex objectives to reinforcement learning (RL) agents can often be difficult,
involving meticulous design of reward functions that are sufficiently informative yet easy …

Deep reinforcement learning from human preferences

PF Christiano, J Leike, T Brown… - Advances in neural …, 2017 - proceedings.neurips.cc
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world
environments, we need to communicate complex goals to these systems. In this work, we …

Directly fine-tuning diffusion models on differentiable rewards

K Clark, P Vicol, K Swersky, DJ Fleet - arxiv preprint arxiv:2309.17400, 2023 - arxiv.org
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-
tuning diffusion models to maximize differentiable reward functions, such as scores from …

Inverse preference learning: Preference-based rl without a reward function

J Hejna, D Sadigh - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …

Contrastive prefence learning: Learning from human feedback without rl

J Hejna, R Rafailov, H Sikchi, C Finn, S Niekum… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …

A survey of preference-based reinforcement learning methods

C Wirth, R Akrour, G Neumann, J Fürnkranz - Journal of Machine Learning …, 2017 - jmlr.org
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires …