Cascade reward sampling for efficient decoding-time alignment

B Li, Y Wang, A Grama, R Zhang - arxiv preprint arxiv:2406.16306, 2024 - arxiv.org
Aligning large language models (LLMs) with human preferences is critical for their
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …

Bpo: Towards balanced preference optimization between knowledge breadth and depth in alignment

S Wang, Y Tong, H Zhang, D Li, X Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large
language models (LLMs) in recent years. In this work, we first introduce the concepts of …

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

E Hua, B Qi, K Zhang, Y Yu, N Ding, X Lv… - arxiv preprint arxiv …, 2024 - arxiv.org
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental
processes for enhancing the capabilities of Language Models (LMs) post pre-training …

Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking

J Ren, Y Zhang, D Liu, X Zhang, Q Tian - arxiv preprint arxiv:2502.01667, 2025 - arxiv.org
Direct preference optimization (DPO) has shown success in aligning diffusion models with
human preference. Previous approaches typically assume a consistent preference label …

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

A Badrinath, P Agarwal, J Xu - arxiv preprint arxiv:2405.17956, 2024 - arxiv.org
For aligning large language models (LLMs), prior work has leveraged reinforcement
learning via human feedback (RLHF) or variations of direct preference optimization (DPO) …

Length Desensitization in Directed Preference Optimization

W Liu, Y Bai, C Han, R Weng, J Xu, X Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from
Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human …

The crucial role of samplers in online direct preference optimization

R Shi, R Zhou, SS Du - arxiv preprint arxiv:2409.19605, 2024 - arxiv.org
Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient
solution for language model alignment. Despite its empirical success, the $\textit …