A review of safe reinforcement learning: Methods, theory and applications

S Gu, L Yang, Y Du, G Chen, F Walter, J Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
Reinforcement Learning (RL) has achieved tremendous success in many complex decision-
making tasks. However, safety concerns are raised during deploying RL in real-world …

Combustion machine learning: Principles, progress and prospects

M Ihme, WT Chung, AA Mishra - Progress in Energy and Combustion …, 2022 - Elsevier
Progress in combustion science and engineering has led to the generation of large amounts
of data from large-scale simulations, high-resolution experiments, and sensors. This corpus …

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

H Lee, S Phatale, H Mansoor, KR Lu, T Mesnard… - 2023 - openreview.net
Reinforcement learning from human feedback (RLHF) is an effective technique for aligning
large language models (LLMs) to human preferences, but gathering high-quality human …

[HTML][HTML] Connecting the dots in trustworthy Artificial Intelligence: From AI principles, ethics, and key requirements to responsible AI systems and regulation

N Díaz-Rodríguez, J Del Ser, M Coeckelbergh… - Information …, 2023 - Elsevier
Abstract Trustworthy Artificial Intelligence (AI) is based on seven technical requirements
sustained over three main pillars that should be met throughout the system's entire life cycle …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

A Srivastava, A Rastogi, A Rao, AAM Shoeb… - arxiv preprint arxiv …, 2022 - arxiv.org
Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative impact, these new …

Guiding pretraining in reinforcement learning with large language models

Y Du, O Watkins, Z Wang, C Colas… - International …, 2023 - proceedings.mlr.press
Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped
reward function. Intrinsically motivated exploration methods address this limitation by …

A generalist agent

S Reed, K Zolna, E Parisotto, SG Colmenarejo… - arxiv preprint arxiv …, 2022 - arxiv.org
Inspired by progress in large-scale language modeling, we apply a similar approach
towards building a single generalist agent beyond the realm of text outputs. The agent …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arxiv preprint arxiv …, 2022 - arxiv.org
Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …