Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Reinforcement learning for generative AI: State of the art, opportunities and open research challenges

G Franceschelli, M Musolesi - Journal of Artificial Intelligence Research, 2024 - jair.org
Abstract Generative Artificial Intelligence (AI) is one of the most exciting developments in
Computer Science of the last decade. At the same time, Reinforcement Learning (RL) has …

MiniLLM: Knowledge distillation of large language models

Y Gu, L Dong, F Wei, M Huang - arxiv preprint arxiv:2306.08543, 2023 - arxiv.org
Knowledge Distillation (KD) is a promising technique for reducing the high computational
demand of large language models (LLMs). However, previous KD methods are primarily …

Diffusion model alignment using direct preference optimization

B Wallace, M Dang, R Rafailov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better …

Reinforced self-training (rest) for language modeling

C Gulcehre, TL Paine, S Srinivasan… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) can improve the quality of large
language model's (LLM) outputs by aligning them with human preferences. We propose a …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

A Rame, G Couairon, C Dancette… - Advances in …, 2023 - proceedings.neurips.cc
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned
on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further …

The alignment problem from a deep learning perspective

R Ngo, L Chan, S Mindermann - arxiv preprint arxiv:2209.00626, 2022 - arxiv.org
In coming years or decades, artificial general intelligence (AGI) may surpass human
capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs …

A long way to go: Investigating length correlations in rlhf

P Singhal, T Goyal, J Xu, G Durrett - arxiv preprint arxiv:2310.03716, 2023 - arxiv.org
Great success has been reported using Reinforcement Learning from Human Feedback
(RLHF) to align large language models, with open preference datasets enabling wider …