Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - Advances in Neural Information …, 2025‏ - proceedings.neurips.cc
Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024‏ - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Flame: Factuality-aware alignment for large language models

SC Lin, L Gao, B Oguz, W **ong… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
Alignment is a procedure to fine-tune pre-trained large language models (LLMs) to follow
natural language instructions and serve as helpful AI assistants. We have observed …

Length-controlled alpacaeval: A simple way to debias automatic evaluators

Y Dubois, B Galambosi, P Liang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
LLM-based auto-annotators have become a key component of the LLM development
process due to their cost-effectiveness and scalability compared to human-based …

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards

H Wang, Y Lin, W **ong, R Yang, S Diao, S Qiu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from …

Length-controlled alpacaeval: A simple debiasing of automatic evaluators

Y Dubois, P Liang, T Hashimoto - First Conference on Language …, 2024‏ - openreview.net
LLM-based auto-annotators have become a key component of the LLM development
process due to their cost-effectiveness and scalability compared to human-based …

Uncertainty-aware reward model: Teaching reward models to know what is unknown

X Lou, D Yan, W Shen, Y Yan, J **e… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Reward models (RM) play a critical role in aligning generations of large language models
(LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity …

Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling

Y Miao, S Zhang, L Ding, R Bao… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
Despite the success of reinforcement learning from human feedback (RLHF) in aligning
language models with human values, reward hacking, also termed reward overoptimization …

Self-generated critiques boost reward modeling for language models

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …

On the algorithmic bias of aligning large language models with rlhf: Preference collapse and matching regularization

J **ao, Z Li, X **e, E Getzen, C Fang, Q Long… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Accurately aligning large language models (LLMs) with human preferences is crucial for
informing fair, economically sound, and statistically efficient decision-making processes …