Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Decoding-time language model alignment with multiple objectives

R Shi, Y Chen, Y Hu, A Liu, H Hajishirzi… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

Towards a unified view of preference learning for large language models: A survey

B Gao, F Song, Y Miao, Z Cai, Z Yang, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial
factors to achieve success is aligning the LLM's output with human preferences. This …

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

K Wang, R Kidambi, R Sullivan, A Agarwal… - arxiv preprint arxiv …, 2024 - arxiv.org
Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …

The perfect blend: Redefining RLHF with mixture of judges

T Xu, E Helenowski, KA Sankararaman, D **… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has become the leading approach for
fine-tuning large language models (LLM). However, RLHF has limitations in multi-task …

Alignment of diffusion models: Fundamentals, challenges, and future

B Liu, S Shao, B Li, L Bai, Z Xu, H **ong, J Kwok… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion models have emerged as the leading paradigm in generative modeling, excelling
in various applications. Despite their success, these models often misalign with human …

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

C Zou, X Guo, R Yang, J Zhang, B Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in
tackling mathematical reasoning tasks that involve visual context. Unlike humans who can …

Cascade reward sampling for efficient decoding-time alignment

B Li, Y Wang, A Grama, R Zhang - arxiv preprint arxiv:2406.16306, 2024 - arxiv.org
Aligning large language models (LLMs) with human preferences is critical for their
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …

Personalization of large language models: A survey

Z Zhang, RA Rossi, B Kveton, Y Shao, D Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Personalization of Large Language Models (LLMs) has recently become increasingly
important with a wide range of applications. Despite the importance and recent progress …

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment

H Sun, M van der Schaar - arxiv preprint arxiv:2405.15624, 2024 - arxiv.org
Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility.
However, existing methods, primarily based on preference datasets, face challenges such …