Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer

Z Liu, M Lu, S Zhang, B Liu, H Guo, Y Yang… - ar** or herding? reward model ensembles mitigate but do not eliminate reward hacking
J Eisenstein, C Nagpal, A Agarwal, A Beirami… - arxiv preprint arxiv …, 2023 - arxiv.org
Reward models play a key role in aligning language model applications towards human
preferences. However, this setup creates an incentive for the language model to exploit …

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm alignment

H Sun, M van der Schaar - arxiv preprint arxiv:2405.15624, 2024 - arxiv.org
Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility.
However, existing methods, primarily based on preference datasets, face challenges such …

Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization

A Huang, W Zhan, T **e, JD Lee, W Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Language model alignment methods, such as reinforcement learning from human feedback
(RLHF), have led to impressive advances in language model capabilities, but existing …

On Uncertainty In Natural Language Processing

D Ulmer - arxiv preprint arxiv:2410.03446, 2024 - arxiv.org
The last decade in deep learning has brought on increasingly capable systems that are
deployed on a wide variety of applications. In natural language processing, the field has …

Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

H Sun, Y Shen, JF Ton, M van der Schaar - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models (LLMs) have made substantial strides in structured tasks through
Reinforcement Learning (RL), demonstrating proficiency in mathematical reasoning and …

Reviving The Classics: Active Reward Modeling in Large Language Model Alignment

Y Shen, H Sun, JF Ton - arxiv preprint arxiv:2502.04354, 2025 - arxiv.org
Building neural reward models from human preferences is a pivotal component in
reinforcement learning from human feedback (RLHF) and large language model alignment …

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

P Rashidinejad, Y Tian - arxiv preprint arxiv:2412.09544, 2024 - arxiv.org
Aligning AI systems with human preferences typically suffers from the infamous reward
hacking problem, where optimization of an imperfect reward model leads to undesired …

The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

Y Miao, S Zhang, L Ding, Y Zhang, L Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human
Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final …