Kto: Model alignment as prospect theoretic optimization

K Ethayarajh, W Xu, N Muennighoff, D Jurafsky… - arxiv preprint arxiv …, 2024 - arxiv.org
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold

A Setlur, S Garg, X Geng, N Garg… - Advances in Neural …, 2025 - proceedings.neurips.cc
Training on model-generated synthetic data is a promising approach for finetuning LLMs,
but it remains unclear when it helps or hurts. In this paper, we investigate this question for …

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org
Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Direct nash optimization: Teaching language models to self-improve with general preferences

C Rosset, CA Cheng, A Mitra, M Santacroce… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …

[PDF][PDF] A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arxiv preprint arxiv …, 2023 - researchgate.net
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Rebel: Reinforcement learning via regressing relative rewards

Z Gao, J Chang, W Zhan, O Oertell… - Advances in …, 2025 - proceedings.neurips.cc
While originally developed for continuous control problems, Proximal Policy Optimization
(PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) …

Model alignment as prospect theoretic optimization

K Ethayarajh, W Xu, N Muennighoff… - … on Machine Learning, 2024 - openreview.net
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner (1992); for example, humans are famously …

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards

H Wang, Y Lin, W **ong, R Yang, S Diao, S Qiu… - arxiv preprint arxiv …, 2024 - arxiv.org
Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …

Generalized preference optimization: A unified approach to offline alignment

Y Tang, ZD Guo, Z Zheng, D Calandriello… - arxiv preprint arxiv …, 2024 - arxiv.org
Offline preference optimization allows fine-tuning large models directly from offline data, and
has proved effective in recent alignment practices. We propose generalized preference …