Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

The TRIPOD-LLM reporting guideline for studies using large language models

J Gallifant, M Afshar, S Ameen, Y Aphinyanaphongs… - Nature Medicine, 2025 - nature.com
Large language models (LLMs) are rapidly being adopted in healthcare, necessitating
standardized reporting guidelines. We present transparent reporting of a multivariable …

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Z Shao, P Wang, Q Zhu, R Xu, J Song, X Bi… - arxiv preprint arxiv …, 2024 - arxiv.org
Mathematical reasoning poses a significant challenge for language models due to its
complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a
pivotal methodology for transferring advanced capabilities from leading proprietary LLMs …

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W **ong, H Dong, C Ye, Z Wang, H Zhong, H Ji… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper studies the alignment process of generative models with Reinforcement Learning
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …

Direct language model alignment from online ai feedback

S Guo, B Zhang, T Liu, T Liu, M Khalman… - arxiv preprint arxiv …, 2024 - arxiv.org
Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …

Rlhf workflow: From reward modeling to online rlhf

H Dong, W **ong, B Pang, H Wang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Nemotron-4 340b technical report

B Adler, N Agarwal, A Aithal, DH Anh… - arxiv preprint arxiv …, 2024 - arxiv.org
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base,
Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access …

Debating with more persuasive llms leads to more truthful answers

A Khan, J Hughes, D Valentine, L Ruis… - arxiv preprint arxiv …, 2024 - arxiv.org
Common methods for aligning large language models (LLMs) with desired behaviour
heavily rely on human-labelled data. However, as models grow increasingly sophisticated …