Iterative reasoning preference optimization

RY Pang, W Yuan, H He, K Cho… - Advances in …, 2025 - proceedings.neurips.cc
Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks. In …

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc
Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold

A Setlur, S Garg, X Geng, N Garg… - Advances in Neural …, 2025 - proceedings.neurips.cc
Training on model-generated synthetic data is a promising approach for finetuning LLMs,
but it remains unclear when it helps or hurts. In this paper, we investigate this question for …

Is dpo superior to ppo for llm alignment? a comprehensive study

S Xu, W Fu, J Gao, W Ye, W Liu, Z Mei, G Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used
method to align large language models (LLMs) with human preferences. Existing RLHF …

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org
Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing

Z Xu, F Jiang, L Niu, Y Deng, R Poovendran… - arxiv preprint arxiv …, 2024 - arxiv.org
High-quality instruction data is critical for aligning large language models (LLMs). Although
some models, such as Llama-3-Instruct, have open weights, their alignment data remain …

Bonbon alignment for large language models and the sweetness of best-of-n sampling

L Gui, C Gârbacea, V Veitch - Advances in Neural …, 2025 - proceedings.neurips.cc
This paper concerns the problem of aligning samples from large language models to human
preferences using* best-of-$ n $* sampling, where we draw $ n $ samples, rank them, and …

Training language models to self-correct via reinforcement learning

A Kumar, V Zhuang, R Agarwal, Y Su… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …

Model alignment as prospect theoretic optimization

K Ethayarajh, W Xu, N Muennighoff… - … on Machine Learning, 2024 - openreview.net
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - Advances in Neural …, 2025 - proceedings.neurips.cc
Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …