Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …

Large language models are effective text rankers with pairwise ranking prompting

Z Qin, R Jagerman, K Hui, H Zhuang, J Wu… - arxiv preprint arxiv …, 2023 - arxiv.org
Ranking documents using Large Language Models (LLMs) by directly feeding the query and
candidate documents into the prompt is an interesting and practical problem. However …

Direct nash optimization: Teaching language models to self-improve with general preferences

C Rosset, CA Cheng, A Mitra, M Santacroce… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper studies post-training large language models (LLMs) using preference feedback
from a powerful oracle to help a model iteratively improve over itself. The typical approach …

Llm comparator: Visual analytics for side-by-side evaluation of large language models

M Kahng, I Tenney, M Pushkarna, MX Liu… - Extended Abstracts of …, 2024 - dl.acm.org
Automatic side-by-side evaluation has emerged as a promising approach to evaluating the
quality of responses from large language models (LLMs). However, analyzing the results …

Building math agents with multi-turn iterative preference learning

W **ong, C Shi, J Shen, A Rosenberg, Z Qin… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have shown that large language models'(LLMs) mathematical problem-
solving capabilities can be enhanced by integrating external tools, such as code …

A survey on human preference learning for large language models

R Jiang, K Chen, X Bai, Z He, J Li, M Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge of versatile large language models (LLMs) largely depends on aligning
increasingly capable foundation models with human intentions by preference learning …

Towards a unified view of preference learning for large language models: A survey

B Gao, F Song, Y Miao, Z Cai, Z Yang, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial
factors to achieve success is aligning the LLM's output with human preferences. This …

Prompt optimization with human feedback

X Lin, Z Dai, A Verma, SK Ng, P Jaillet… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated remarkable performances in various
tasks. However, the performance of LLMs heavily depends on the input prompt, which has …

Alignment of diffusion models: Fundamentals, challenges, and future

B Liu, S Shao, B Li, L Bai, Z Xu, H **ong, J Kwok… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion models have emerged as the leading paradigm in generative modeling, excelling
in various applications. Despite their success, these models often misalign with human …

Filtered direct preference optimization

T Morimura, M Sakamoto, Y **nai, K Abe… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning
language models with human preferences. While the significance of dataset quality is …