Self-generated critiques boost reward modeling for language models

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …

Rrm: Robust reward model training mitigates reward hacking

T Liu, W **ong, J Ren, L Chen, J Wu, R Joshi… - arxiv preprint arxiv …, 2024 - arxiv.org
Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with
human preferences. However, traditional RM training, which relies on response pairs tied to …

RRM: ROBUST REWARD MODEL TRAINING MITI

GR HACKING - openreview.net
Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with
human preferences. However, traditional RM training, which relies on response pairs tied to …