- Academic Search

A Ramé, N Vieillard, L Hussenot, R Dadashi… - arxiv preprint arxiv …, 2024 - arxiv.org

Aligning large language models (LLMs) with human preferences through reinforcement
learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward …

Save Cite Cited by 54 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Understanding and improving feature learning for out-of-distribution generalization

Y Chen, W Huang, K Zhou, Y Bian… - Advances in Neural …, 2024 - proceedings.neurips.cc

A common explanation for the failure of out-of-distribution (OOD) generalization is that the
model trained with empirical risk minimization (ERM) learns spurious features instead of …

Save Cite Cited by 32 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] aclanthology.org

Mitigating the alignment tax of rlhf

Y Lin, H Lin, W **ong, S Diao, J Liu… - Proceedings of the …, 2024 - aclanthology.org

LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under
Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained …

Save Cite Cited by 12 Related articles View as HTML

[Free GPT-4]

[PDF] neurips.cc

Towards stable backdoor purification through feature shift tuning

R Min, Z Qin, L Shen, M Cheng - Advances in Neural …, 2024 - proceedings.neurips.cc

It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor
attacks where attackers could manipulate the model behavior maliciously by tampering with …

Save Cite Cited by 21 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Warp: On the benefits of weight averaged rewarded policies

A Ramé, J Ferret, N Vieillard, R Dadashi… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs)
by encouraging their generations to have high rewards, using a reward model trained on …

Save Cite Cited by 9 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

On the limited generalization capability of the implicit reward model induced by direct preference optimization

Y Lin, S Seto, M Ter Hoeve, K Metcalf… - arxiv preprint arxiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning
language models to human preferences. Central to RLHF is learning a reward function for …

Save Cite Cited by 5 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Dawin: Training-free dynamic weight interpolation for robust adaptation

C Oh, Y Li, K Song, S Yun, D Han - arxiv preprint arxiv:2410.03782, 2024 - arxiv.org

Adapting a pre-trained foundation model on downstream tasks should ensure robustness
against distribution shifts without the need to retrain the whole model. Although existing …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Continuous Invariance Learning

Y Lin, F Zhou, L Tan, L Ma, J Liu, Y He, Y Yuan… - arxiv preprint arxiv …, 2023 - arxiv.org

Invariance learning methods aim to learn invariant features in the hope that they generalize
under distributional shifts. Although many tasks are naturally characterized by continuous …

Save Cite Cited by 3 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Discovering environments with XRM

M Pezeshki, D Bouchacourt, M Ibrahim… - arxiv preprint arxiv …, 2023 - arxiv.org

Successful out-of-distribution generalization requires environment annotations.
Unfortunately, these are resource-intensive to obtain, and their relevance to model …

Save Cite Cited by 14 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Even small correlation and diversity shifts pose dataset-bias issues

A Bissoto, C Barata, E Valle, S Avila - Pattern Recognition Letters, 2024 - Elsevier

Distribution shifts hinder the deployment of deep learning in real-world problems.
Distribution shifts appear when train and test data come from different sources, which …

Save Cite Cited by 5 Related articles All 6 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Spurious feature diversification improves out-of-distribution generalization

Warm: On the benefits of weight averaged reward models

Understanding and improving feature learning for out-of-distribution generalization

Mitigating the alignment tax of rlhf

Towards stable backdoor purification through feature shift tuning

Warp: On the benefits of weight averaged rewarded policies

On the limited generalization capability of the implicit reward model induced by direct preference optimization

Dawin: Training-free dynamic weight interpolation for robust adaptation

Continuous Invariance Learning

Discovering environments with XRM

Even small correlation and diversity shifts pose dataset-bias issues