Strengthening multimodal large language model with bootstrapped preference optimization
Abstract Multimodal Large Language Models (MLLMs) excel in generating responses based
on visual inputs. However, they often suffer from a bias towards generating responses …
on visual inputs. However, they often suffer from a bias towards generating responses …
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …
development of automatic video metrics is lagging significantly behind. None of the existing …
Dpo meets ppo: Reinforced token optimization for rlhf
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Skywork-reward: Bag of tricks for reward modeling in llms
In this report, we introduce a collection of methods to enhance reward modeling for LLMs,
focusing specifically on data-centric techniques. We propose effective data selection and …
focusing specifically on data-centric techniques. We propose effective data selection and …
Rlhf workflow: From reward modeling to online rlhf
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …
Prometheus 2: An open source language model specialized in evaluating other language models
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …
various LMs. However, concerns including transparency, controllability, and affordability …
Decoding-time language model alignment with multiple objectives
Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …
enabling these models to better serve diverse user needs. Existing methods primarily focus …
Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …
(eg, creativity and safety). A key challenge is to develop steerable language models that …
Llm-as-a-judge & reward model: What they can and cannot do
LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice
questions or human annotators for large language model (LLM) evaluation. Their efficacy …
questions or human annotators for large language model (LLM) evaluation. Their efficacy …
Bi-factorial preference optimization: Balancing safety-helpfulness in language models
Fine-tuning large language models (LLMs) on human preferences, typically through
reinforcement learning from human feedback (RLHF), has proven successful in enhancing …
reinforcement learning from human feedback (RLHF), has proven successful in enhancing …