Cambrian-1: A fully open, vision-centric exploration of multimodal llms

P Tong, E Brown, P Wu, S Woo… - Advances in …, 2025 - proceedings.neurips.cc
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W **ong, H Dong, C Ye, Z Wang, H Zhong, H Ji… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper studies the alignment process of generative models with Reinforcement Learning
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …

Regularizing hidden states enables learning generalizable reward model for llms

R Yang, R Ding, Y Lin, H Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc
Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

H Wang, W **ong, T **e, H Zhao, T Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - Advances in Neural …, 2025 - proceedings.neurips.cc
Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Lipo: Listwise preference optimization through learning-to-rank

T Liu, Z Qin, J Wu, J Shen, M Khalman, R Joshi… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning language models (LMs) with curated human feedback is critical to control their
behaviors in real-world applications. Several recent policy optimization methods, such as …

Weak-to-strong extrapolation expedites alignment

C Zheng, Z Wang, H Ji, M Huang, N Peng - arxiv preprint arxiv …, 2024 - arxiv.org
The open-source community is experiencing a surge in the release of large language
models (LLMs) that are trained to follow instructions and align with human preference …

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf

T **e, DJ Foster, A Krishnamurthy, C Rosset… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …