Iterative reasoning preference optimization

RY Pang, W Yuan, H He, K Cho… - Advances in …, 2025 - proceedings.neurips.cc
Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks. In …

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org
Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Smaug: Fixing failure modes of preference optimisation with dpo-positive

A Pal, D Karkhanis, S Dooley, M Roberts… - arxiv preprint arxiv …, 2024 - arxiv.org
Direct Preference Optimisation (DPO) is effective at significantly improving the performance
of large language models (LLMs) on downstream tasks such as reasoning, summarisation …

Preference fine-tuning of llms should leverage suboptimal, on-policy data

F Tajwar, A Singh, A Sharma, R Rafailov… - arxiv preprint arxiv …, 2024 - arxiv.org
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised …

Token-level direct preference optimization

Y Zeng, G Liu, W Ma, N Yang, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with
human values and intentions. This process often utilizes methods like pairwise comparisons …

Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Z Liu, M Lu, S Zhang, B Liu, H Guo, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning generative models with human preference via RLHF typically suffers from
overoptimization, where an imperfectly learned reward model can misguide the generative …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …