Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies

L Pan, M Saxon, W Xu, D Nathani, X Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable performance across a wide
array of NLP tasks. However, their efficacy is undermined by undesired and inconsistent …

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2023 - proceedings.neurips.cc
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …

Self-rag: Learning to retrieve, generate, and critique through self-reflection

A Asai, Z Wu, Y Wang, A Sil… - The Twelfth International …, 2023 - openreview.net
Despite their remarkable capabilities, large language models (LLMs) often produce
responses containing factual inaccuracies due to their sole reliance on the parametric …

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state …

Chain-of-verification reduces hallucination in large language models

S Dhuliawala, M Komeili, J Xu, R Raileanu, X Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Generation of plausible yet incorrect factual information, termed hallucination, is an
unsolved issue in large language models. We study the ability of language models to …

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

T Yu, Y Yao, H Zhang, T He, Y Han… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Large Language Models (MLLMs) have recently demonstrated
impressive capabilities in multimodal understanding reasoning and interaction. However …

Safe rlhf: Safe reinforcement learning from human feedback

J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the development of large language models (LLMs), striking a balance between the
performance and safety of AI systems has never been more critical. However, the inherent …

Detecting and preventing hallucinations in large vision language models

A Gunjal, J Yin, E Bas - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in
generalizing across a diverse set of multi-modal tasks, especially for Visual Question …

Preference ranking optimization for human alignment

F Song, B Yu, M Li, H Yu, F Huang, Y Li… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Large language models (LLMs) often contain misleading content, emphasizing the need to
align them with human values to ensure secure AI systems. Reinforcement learning from …