Strengthening multimodal large language model with bootstrapped preference optimization

R Pi, T Han, W **ong, J Zhang, R Liu, R Pan… - … on Computer Vision, 2024 - Springer
Abstract Multimodal Large Language Models (MLLMs) excel in generating responses based
on visual inputs. However, they often suffer from a bias towards generating responses …

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W **ong, X Cheng, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Skywork-reward: Bag of tricks for reward modeling in llms

CY Liu, L Zeng, J Liu, R Yan, J He, C Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
In this report, we introduce a collection of methods to enhance reward modeling for LLMs,
focusing specifically on data-centric techniques. We propose effective data selection and …

Rlhf workflow: From reward modeling to online rlhf

H Dong, W **ong, B Pang, H Wang, H Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline counterpart …

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

Decoding-time language model alignment with multiple objectives

R Shi, Y Chen, Y Hu, A Liu, H Hajishirzi… - arxiv preprint arxiv …, 2024 - arxiv.org
Aligning language models (LMs) to human preferences has emerged as a critical pursuit,
enabling these models to better serve diverse user needs. Existing methods primarily focus …

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

K Wang, R Kidambi, R Sullivan, A Agarwal… - arxiv preprint arxiv …, 2024 - arxiv.org
Reward-based finetuning is crucial for aligning language policies with intended behaviors
(eg, creativity and safety). A key challenge is to develop steerable language models that …

Llm-as-a-judge & reward model: What they can and cannot do

G Son, H Ko, H Lee, Y Kim, S Hong - arxiv preprint arxiv:2409.11239, 2024 - arxiv.org
LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice
questions or human annotators for large language model (LLM) evaluation. Their efficacy …

Bi-factorial preference optimization: Balancing safety-helpfulness in language models

W Zhang, PHS Torr, M Elhoseiny, A Bibi - arxiv preprint arxiv:2408.15313, 2024 - arxiv.org
Fine-tuning large language models (LLMs) on human preferences, typically through
reinforcement learning from human feedback (RLHF), has proven successful in enhancing …