Cambrian-1: A fully open, vision-centric exploration of multimodal llms
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …
centric approach. While stronger language models can enhance multimodal capabilities, the …
Simpo: Simple preference optimization with a reference-free reward
Abstract Direct Preference Optimization (DPO) is a widely used offline preference
optimization algorithm that reparameterizes reward functions in reinforcement learning from …
optimization algorithm that reparameterizes reward functions in reinforcement learning from …
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
This paper studies the alignment process of generative models with Reinforcement Learning
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …
from Human Feedback (RLHF). We first identify the primary challenges of existing popular …
Regularizing hidden states enables learning generalizable reward model for llms
Reward models trained on human preference data have been proven to effectively align
Large Language Models (LLMs) with human intent within the framework of reinforcement …
Large Language Models (LLMs) with human intent within the framework of reinforcement …
Interpretable preferences via multi-objective reward modeling and mixture-of-experts
Reinforcement learning from human feedback (RLHF) has emerged as the primary method
for aligning large language models (LLMs) with human preferences. The RLHF process …
for aligning large language models (LLMs) with human preferences. The RLHF process …
Aligning to thousands of preferences via system message generalization
Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …
alignment methods often assume that aligning LLMs with the general public's preferences is …
Dpo meets ppo: Reinforced token optimization for rlhf
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Lipo: Listwise preference optimization through learning-to-rank
Aligning language models (LMs) with curated human feedback is critical to control their
behaviors in real-world applications. Several recent policy optimization methods, such as …
behaviors in real-world applications. Several recent policy optimization methods, such as …
Weak-to-strong extrapolation expedites alignment
The open-source community is experiencing a surge in the release of large language
models (LLMs) that are trained to follow instructions and align with human preference …
models (LLMs) that are trained to follow instructions and align with human preference …
Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …
language model alignment. We consider online exploration in RLHF, which exploits …