Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …
The TRIPOD-LLM reporting guideline for studies using large language models
Large language models (LLMs) are rapidly being adopted in healthcare, necessitating
standardized reporting guidelines. We present transparent reporting of a multivariable …
standardized reporting guidelines. We present transparent reporting of a multivariable …
Deepseekmath: Pushing the limits of mathematical reasoning in open language models
Mathematical reasoning poses a significant challenge for language models due to its
complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which …
complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
A survey on knowledge distillation of large language models
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a
pivotal methodology for transferring advanced capabilities from leading proprietary LLMs …
pivotal methodology for transferring advanced capabilities from leading proprietary LLMs …
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
This paper studies the alignment process of generative models with Reinforcement Learning
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …
Direct language model alignment from online ai feedback
Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …
efficient alternatives to reinforcement learning from human feedback (RLHF), that do not …
Rlhf workflow: From reward modeling to online rlhf
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …
Nemotron-4 340b technical report
B Adler, N Agarwal, A Aithal, DH Anh… - arxiv preprint arxiv …, 2024 - arxiv.org
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base,
Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access …
Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access …
Debating with more persuasive llms leads to more truthful answers
Common methods for aligning large language models (LLMs) with desired behaviour
heavily rely on human-labelled data. However, as models grow increasingly sophisticated …
heavily rely on human-labelled data. However, as models grow increasingly sophisticated …