Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

A survey of imitation learning: Algorithms, recent developments, and challenges

M Zare, PM Kebria, A Khosravi… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In recent years, the development of robotics and artificial intelligence (AI) systems has been
nothing short of remarkable. As these systems continue to evolve, they are being utilized in …

[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

[PDF][PDF] Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems

D Dalrymple, J Skalse, Y Bengio… - arxiv preprint arxiv …, 2024 - eecs.berkeley.edu
Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a
crucial challenge, especially for AI systems with a high degree of autonomy and general …

Maximum-likelihood inverse reinforcement learning with finite-time guarantees

S Zeng, C Li, A Garcia, M Hong - Advances in Neural …, 2022 - proceedings.neurips.cc
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated
optimal policy that best fits observed sequences of states and actions implemented by an …

Invariance in policy optimisation and partial identifiability in reward learning

JMV Skalse, M Farrugia-Roberts… - International …, 2023 - proceedings.mlr.press
It is often very challenging to manually design reward functions for complex, real-world
tasks. To solve this, one can instead use reward learning to infer a reward function from …

Misspecification in inverse reinforcement learning

J Skalse, A Abate - Proceedings of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Abstract The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function R from
a policy pi. To do this, we need a model of how pi relates to R. In the current literature, the …

Beyond preferences in ai alignment

T Zhi-Xuan, M Carroll, M Franklin, H Ashton - Philosophical Studies, 2024 - Springer
The dominant practice of AI alignment assumes (1) that preferences are an adequate
representation of human values,(2) that human rationality can be understood in terms of …

Identifiability in inverse reinforcement learning

H Cao, S Cohen, L Szpruch - Advances in Neural …, 2021 - proceedings.neurips.cc
Inverse reinforcement learning attempts to reconstruct the reward function in a Markov
decision problem, using observations of agent actions. As already observed in Russell …

Models of human preference for learning reward functions

WB Knox, S Hatgis-Kessell, S Booth, S Niekum… - arxiv preprint arxiv …, 2022 - arxiv.org
The utility of reinforcement learning is limited by the alignment of reward functions with the
interests of human stakeholders. One promising method for alignment is to learn the reward …