Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Cascade reward sampling for efficient decoding-time alignment
Aligning large language models (LLMs) with human preferences is critical for their
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …
deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play …
Bpo: Towards balanced preference optimization between knowledge breadth and depth in alignment
Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large
language models (LLMs) in recent years. In this work, we first introduce the concepts of …
language models (LLMs) in recent years. In this work, we first introduce the concepts of …
Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental
processes for enhancing the capabilities of Language Models (LMs) post pre-training …
processes for enhancing the capabilities of Language Models (LMs) post pre-training …
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
J Ren, Y Zhang, D Liu, X Zhang, Q Tian - arxiv preprint arxiv:2502.01667, 2025 - arxiv.org
Direct preference optimization (DPO) has shown success in aligning diffusion models with
human preference. Previous approaches typically assume a consistent preference label …
human preference. Previous approaches typically assume a consistent preference label …
Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives
For aligning large language models (LLMs), prior work has leveraged reinforcement
learning via human feedback (RLHF) or variations of direct preference optimization (DPO) …
learning via human feedback (RLHF) or variations of direct preference optimization (DPO) …
Length Desensitization in Directed Preference Optimization
Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from
Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human …
Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human …
The crucial role of samplers in online direct preference optimization
Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient
solution for language model alignment. Despite its empirical success, the $\textit …
solution for language model alignment. Despite its empirical success, the $\textit …