Warm: On the benefits of weight averaged reward models
Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …
Understanding and improving feature learning for out-of-distribution generalization
A common explanation for the failure of out-of-distribution (OOD) generalization is that the
model trained with empirical risk minimization (ERM) learns spurious features instead of …
model trained with empirical risk minimization (ERM) learns spurious features instead of …
Mitigating the alignment tax of rlhf
LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under
Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained …
Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained …
Towards stable backdoor purification through feature shift tuning
It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor
attacks where attackers could manipulate the model behavior maliciously by tampering with …
attacks where attackers could manipulate the model behavior maliciously by tampering with …
Warp: On the benefits of weight averaged rewarded policies
Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs)
by encouraging their generations to have high rewards, using a reward model trained on …
by encouraging their generations to have high rewards, using a reward model trained on …
On the limited generalization capability of the implicit reward model induced by direct preference optimization
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning
language models to human preferences. Central to RLHF is learning a reward function for …
language models to human preferences. Central to RLHF is learning a reward function for …
Dawin: Training-free dynamic weight interpolation for robust adaptation
Adapting a pre-trained foundation model on downstream tasks should ensure robustness
against distribution shifts without the need to retrain the whole model. Although existing …
against distribution shifts without the need to retrain the whole model. Although existing …
Continuous Invariance Learning
Invariance learning methods aim to learn invariant features in the hope that they generalize
under distributional shifts. Although many tasks are naturally characterized by continuous …
under distributional shifts. Although many tasks are naturally characterized by continuous …
Discovering environments with XRM
Successful out-of-distribution generalization requires environment annotations.
Unfortunately, these are resource-intensive to obtain, and their relevance to model …
Unfortunately, these are resource-intensive to obtain, and their relevance to model …
Even small correlation and diversity shifts pose dataset-bias issues
Distribution shifts hinder the deployment of deep learning in real-world problems.
Distribution shifts appear when train and test data come from different sources, which …
Distribution shifts appear when train and test data come from different sources, which …