Warm: On the benefits of weight averaged reward models

A Ramé, N Vieillard, L Hussenot, R Dadashi… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …

Understanding and improving feature learning for out-of-distribution generalization

Y Chen, W Huang, K Zhou, Y Bian… - Advances in Neural …, 2024 - proceedings.neurips.cc
A common explanation for the failure of out-of-distribution (OOD) generalization is that the
model trained with empirical risk minimization (ERM) learns spurious features instead of …

Mitigating the alignment tax of rlhf

Y Lin, H Lin, W **ong, S Diao, J Liu… - Proceedings of the …, 2024 - aclanthology.org
LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under
Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained …

Towards stable backdoor purification through feature shift tuning

R Min, Z Qin, L Shen, M Cheng - Advances in Neural …, 2024 - proceedings.neurips.cc
It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor
attacks where attackers could manipulate the model behavior maliciously by tampering with …

Warp: On the benefits of weight averaged rewarded policies

A Ramé, J Ferret, N Vieillard, R Dadashi… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs)
by encouraging their generations to have high rewards, using a reward model trained on …

On the limited generalization capability of the implicit reward model induced by direct preference optimization

Y Lin, S Seto, M Ter Hoeve, K Metcalf… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning
language models to human preferences. Central to RLHF is learning a reward function for …

Dawin: Training-free dynamic weight interpolation for robust adaptation

C Oh, Y Li, K Song, S Yun, D Han - arxiv preprint arxiv:2410.03782, 2024 - arxiv.org
Adapting a pre-trained foundation model on downstream tasks should ensure robustness
against distribution shifts without the need to retrain the whole model. Although existing …

Continuous Invariance Learning

Y Lin, F Zhou, L Tan, L Ma, J Liu, Y He, Y Yuan… - arxiv preprint arxiv …, 2023 - arxiv.org
Invariance learning methods aim to learn invariant features in the hope that they generalize
under distributional shifts. Although many tasks are naturally characterized by continuous …

Discovering environments with XRM

M Pezeshki, D Bouchacourt, M Ibrahim… - arxiv preprint arxiv …, 2023 - arxiv.org
Successful out-of-distribution generalization requires environment annotations.
Unfortunately, these are resource-intensive to obtain, and their relevance to model …

Even small correlation and diversity shifts pose dataset-bias issues

A Bissoto, C Barata, E Valle, S Avila - Pattern Recognition Letters, 2024 - Elsevier
Distribution shifts hinder the deployment of deep learning in real-world problems.
Distribution shifts appear when train and test data come from different sources, which …